[SOUND]
Hello.
Welcome to the course Text Mining and
Analytics.
My name is ChengXiang Zhai.
I have a nickname, Cheng.
I am a professor of the Department of
Computer Science at the University of
Illinois at Urbana-Champaign.
This course is a part of
a data mining specialization
offered by the University of
Illinois at Urbana-Champaign.
In addition to this course,
there are four other courses offered by
Professor Jiawei Han,
Professor John Hart and me, followed by
a capstone project course that
all of us will teach together.
This course is particularly related to
another course in the specialization,
mainly text retrieval and search engines
in that both courses are about text data.
In contrast, pattern discovery and
cluster analysis are about
algorithms more applicable to
all kinds of data in general.
The visualization course is also
relatively general in that the techniques
can be applied to all kinds of data.
This course addresses a pressing need for
harnessing big text data.
Text data has been growing
dramatically recently,
mostly because of the advance of
technologies deployed on the web
that would enable people to
quickly generate text data.
So, I listed some of
the examples on this slide
that can show a variety of text
data that are available today.
For example, if you think about
the data on the internet, on the web,
everyday we are seeing many
web pages being created.
Blogs are another kind
of new text data that
are being generated quickly by people.
Anyone can write a blog
article on the web.
New articles of course have always been
a main kind of text data that
being generated everyday.
Emails are yet another kind of text data.
And literature is also representing
a large portion of text data.
It's also especially very important
because of the high quality
in the data.
That is,
we encode our knowledge about the word
using text data represented by
all the literature articles.
It's a vast amount of knowledge of
all the text and
data in these literature articles.
Twitter is another representative
text data representing social media.
Of course there are forums as well.
People are generating tweets very quickly
indeed as we are speaking perhaps many
people have already written many tweets.
So, as you can see there
are all kinds of text data
that are being generated very quickly.
Now these text data present
some challenges for people.
It's very hard for anyone to
digest all the text data quickly.
In particular, it's impossible for
scientists to read all of the for
example or for
anyone to read all the tweets.
So there's a need for tools to help
people digest text data more efficiently.
There is also another
interesting opportunity
provided by such big text data, and
that is it's possible to leverage
the amount of text data to
discover interesting patterns to
turn text data into actionable knowledge
that can be useful for decision making.
So for example,
product managers may be interested
in knowing the feedback of
customers about their products,
knowing how well their
products are being received as
compared with the products of competitors.
This can be a good opportunity for
leveraging text data as we have seen
a lot of reviews of product on the web.
So if we can develop a master text
mining techniques to tap into such
a [INAUDIBLE] to extract the knowledge and
opinions of people about these products,
then we can help these product managers
to gain business intelligence or
to essentially feedback
from their customers.
In scientific research, for example,
scientists are interested in knowing
the trends of research topics, knowing
about what related fields have discovered.
This problem is especially important
in biology research as well.
Different communities tend to
use different terminologies, yet
they're starting very similar problems.
So how can we integrate the knowledge
that is covered in different communities
to help study a particular problem?
It's very important, and
it can speed up scientific discovery.
So there are many such examples
where we can leverage the text data
to discover useable knowledge
to optimize our decision.
The main techniques for
harnessing big text data are text
retrieval and text mining.
So these are two very much
related technologies.Yet,
they have somewhat different purposes.
These two kinds of techniques are covered
in the tool in this specialization.
So, text retrieval on search
engines covers text retrieval,
and this is necessary to
turn big text data into
a much smaller but more relevant text
data, which are often the data that
we need to handle a particular problem or
to optimize a particular decision.
This course covers text mining which
is a second step in this pipeline
that can be used to further process
the small amount of relevant data
to extract the knowledge or to help
people digest the text data easily.
So the two courses are clearly related,
in fact,
some of the techniques are shared by
both text retrieval and text mining.
If you have already taken the text
retrieval course, then you might see
some of the content being repeated
in this text mining course, although
we'll be talking about the techniques
from a very different perspective.
If you have not taken
the text retrieval course,
it's also fine because this
course is self-contained and
you can certainly understand all of
the materials without a problem.
Of course, you might find it
beneficial to take both courses and
that will give you a very complete set
of skills to handle big text data.
[MUSIC]

[SOUND]
This
lecture is a brief
introduction to the course.
We're going to cover the objectives
of the course, the prerequisites and
course formats, reference books and
how to complete the course.
The objectives of the course
are the following.
First, we would like to
cover the basic context and
practical techniques of text data mining.
So this means we will not be able to
cover some advanced techniques in detail,
but whether we choose
the practical use for
techniques and then treat them in order.
We're going to also cover the basic
concepts that are very useful for
many applications.
The second objective is to cover
more general techniques for
text or data mining, so
we emphasize the coverage of general
techniques that can be applicable to
any text in any natural language.
We also hope that these
techniques to either
automatically work on problems
without any human effort or
only requiring minimum human effort.
So these criteria have
helped others to choose
techniques that can be
applied to many applications.
This is in contrast to some more
detailed analysis of text data,
particularly using natural
language processing techniques.
Now such techniques
are also very important.
And they are indeed, necessary for
some of the applications,
where we would like to go in-depth to
understand text, they are in more detail.
Such detail in understanding techniques,
however,
are generally not scalable and they
tend to require a lot of human effort.
So they cannot be easy
to apply to any domain.
So as you can imagine in practice,
it would be beneficial to combine
both kinds of techniques using
the general techniques that we'll be
covering in this course as a basis and
improve these techniques by using more
human effort whenever it's appropriate.
We also would like to provide a hands-on
experience to you in multiple aspects.
First, you'll do some experiments
using a text mining toolkit and
implementing text mining algorithms.
Second, you will have opportunity to
experiment with some algorithms for
text mining and
analytics to try them on some datasets and
to understand how to do experiments.
And finally, you have opportunity
to participate in a competition
of text-based prediction task.
You're expected to know the basic
concepts of computer science.
For example, the data structures and
some other really basic
concepts in computer science.
You are also expected to be
familiar with programming and
comfortable with programming,
particularly with C++.
This course,
however is not about programming.
So you are not expected to
do a lot of coding, but
we're going to give you C++ toolkit
that's fairly sophisticated.
So you have to be comfortable
with handling such a toolkit and
you may be asked to write
a small amount of code.
It's also useful if you
know some concepts and
techniques in probability and
statistics, but it's not necessary.
Knowing such knowledge would help you
understand some of the algorithm in
more depth.
The format of the course is lectures
plus quizzes that will be given to you
in the regular basis and there is
also optional programming assignment.
Now, we've made programming
assignments optional.
Not because it's not important, but
because we suspect that the not
all of you will have the need for
computing resources to do
the program assignment.
So naturally,
we would encourage all of you to try to do
the program assignments,
if possible as that will be a great way
to learn about the knowledge
that we teach in this course.
There's no required reading for
this course,
but I was list some of
the useful reference books here.
So we expect you to be able to understand
all the essential materials by just
watching the actual videos and
you should be able to answer all the quiz
questions by just watching the videos.
But it's always good to read additional
books in the larger scope of knowledge,
so here is this the four books.
The first is a textbook about
statistical language processing.
Some of the chapters [INAUDIBLE]
are especially relevant to this course.
The second one is a textbook
about information retrieval,
but it has broadly covered
a number of techniques that
are really in the category
of text mining techniques.
So it's also useful, because of that.
The third book is actually
a collection of silly articles and
it has broadly covered all
the aspects of mining text data.
The mostly relevant chapters
are also listed here.
In these chapters, you can find
some in depth discussion of cutting
edge research on the topics that
we discussed in this course.
And the last one is actually
a book that Sean Massung and
I are currently writing and
we're going to make the rough
draft chapters available at
this URL listed right here.
You can also find additional
reference books and
other readings at the URL
listed at the bottom.
So finally, some information about how
to complete the course this
information is also on the web.
So I just briefly go over it and
you can complete the course by
earning one of the following badges.
One is Course Achievement Badge.
To earn that,
you have to have at least a 70%
average score on all the quizzes combined.
It does mean every quiz has to be 70% or
better.
The second batch here,
this is a Course Mastery Badge and
this just requires a higher score,
90% average score for the quizzes.
There are also three
optional programming badges.
I said earlier that we encourage you
to do programming assignments, but
they're not necessary,
they're not required.
The first is
Programming Achievement Badge.
This is similar to the call
switching from the badge.
Here would require you to get at least 70%
average score on programming assignments.
And similarly, the mastery badge
is given to those who can score
90% average score or better.
The last badge is
a Text Mining Competition Leader Badge and
this is given to those of you who
do well in the competition task.
And specifically, we're planning to give
the badge to the top
30% in the leaderboard.
[MUSIC]

[SOUND]
In
this lecture we give an overview
of Text Mining and Analytics.
First, let's define the term text mining,
and the term text analytics.
The title of this course is
called Text Mining and Analytics.
But the two terms text mining, and text
analytics are actually roughly the same.
So we are not really going to
really distinguish them, and
we're going to use them interchangeably.
But the reason that we have chosen to use
both terms in the title is because
there is also some subtle difference,
if you look at the two phrases literally.
Mining emphasizes more on the process.
So it gives us a error rate
medical view of the problem.
Analytics, on the other hand
emphasizes more on the result,
or having a problem in mind.
We are going to look at text
data to help us solve a problem.
But again as I said, we can treat
these two terms roughly the same.
And I think in the literature
you probably will find the same.
So we're not going to really
distinguish that in the course.
Both text mining and
text analytics mean that we
want to turn text data into high quality
information, or actionable knowledge.
So in both cases, we
have the problem of dealing with
a lot of text data and we hope to.
Turn these text data into something more
useful to us than the raw text data.
And here we distinguish
two different results.
One is high-quality information,
the other is actionable knowledge.
Sometimes the boundary between
the two is not so clear.
But I also want to say a little bit about
these two different angles of
the result of text field mining.
In the case of high quality information,
we refer to more
concise information about the topic.
Which might be much easier for
humans to digest than the raw text data.
For example, you might face
a lot of reviews of a product.
A more concise form of information
would be a very concise summary
of the major opinions about
the features of the product.
Positive about,
let's say battery life of a laptop.
Now this kind of results are very useful
to help people digest the text data.
And so this is to minimize a human effort
in consuming text data in some sense.
The other kind of output
is actually more knowledge.
Here we emphasize the utility
of the information or
knowledge we discover from text data.
It's actionable knowledge for some
decision problem, or some actions to take.
For example, we might be able to determine
which product is more appealing to us,
or a better choice for
a shocking decision.
Now, such an outcome could be
called actionable knowledge,
because a consumer can take the knowledge
and make a decision, and act on it.
So, in this case text mining supplies
knowledge for optimal decision making.
But again, the two are not so
clearly distinguished, so
we don't necessarily have
to make a distinction.
Text mining is also
related to text retrieval,
which is a essential component
in many text mining systems.
Now, text retrieval refers to
finding relevant information from
a large amount of text data.
So I've taught another separate book
on text retrieval and search engines.
Where we discussed various techniques for
text retrieval.
If you have taken that book,
and you will find some overlap.
And it will be useful To know
the background of text retrieval
of understanding some of
the topics in text mining.
But, if you have not taken that book,
it's also fine because in this book
on text mining and analytics, we're
going to repeat some of the key concepts
that are relevant for text mining.
But they're at the high level and
they also explain the relation between
text retrieval and text mining.
Text retrieval is very useful for
text mining in two ways.
First, text retrieval can be
a preprocessor for text mining.
Meaning that it can help
us turn big text data into
a relatively small amount
of most relevant text data.
Which is often what's needed for
solving a particular problem.
And in this sense, text retrieval
also helps minimize human effort.
Text retrieval is also needed for
knowledge provenance.
And this roughly corresponds
to the interpretation of text
mining as turning text data
into actionable knowledge.
Once we find the patterns in text data, or
actionable knowledge, we generally
would have to verify the knowledge.
By looking at the original text data.
So the users would have to have some text
retrieval support, go back to the original
text data to interpret the pattern or
to better understand an analogy or
to verify whether a pattern
is really reliable.
So this is a high level introduction
to the concept of text mining,
and the relationship between
text mining and retrieval.
Next, let's talk about text
data as a special kind of data.
Now it's interesting to
view text data as data
generated by humans as subjective sensors.
So, this slide shows an analogy
between text data and non-text data.
And between humans as
subjective sensors and
physical sensors,
such as a network sensor or a thermometer.
So in general a sensor would
monitor the real world in some way.
It would sense some signal
from the real world, and
then would report the signal as data,
in various forms.
For example, a thermometer would watch
the temperature of real world and
then we report the temperature
being a particular format.
Similarly, a geo sensor would sense
the location and then report.
The location specification, for
example, in the form of longitude
value and latitude value.
A network sends over
the monitor network traffic,
or activities in the network and
are reported.
Some digital format of data.
Similarly we can think of
humans as subjective sensors.
That will observe the real world and
from some perspective.
And then humans will express what they
have observed in the form of text data.
So, in this sense, human is actually
a subjective sensor that would also
sense what's happening in the world and
then express what's observed in the form
of data, in this case, text data.
Now, looking at the text data in
this way has an advantage of being
able to integrate all
types of data together.
And that's indeed needed in
most data mining problems.
So here we are looking at
the general problem of data mining.
And in general we would Be
dealing with a lot of data
about our world that
are related to a problem.
And in general it will be dealing with
both non-text data and text data.
And of course the non-text data
are usually produced by physical senses.
And those non-text data can
be also of different formats.
Numerical data, categorical,
or relational data,
or multi-media data like video or speech.
So, these non text data are often
very important in some problems.
But text data is also very important,
mostly because they contain
a lot of symmetrical content.
And they often contain
knowledge about the users,
especially preferences and
opinions of users.
So, but by treating text data as
the data observed from human sensors,
we can treat all this data
together in the same framework.
So the data mining problem is
basically to turn such data,
turn all the data in your actionable
knowledge to that we can take advantage
of it to change the real
world of course for better.
So this means the data mining problem is
basically taking a lot of data as input
and giving actionable knowledge as output.
Inside of the data mining module,
you can also see
we have a number of different
kind of mining algorithms.
And this is because, for
different kinds of data,
we generally need different algorithms for
mining the data.
For example,
video data might require computer
vision to understand video content.
And that would facilitate
the more effective mining.
And we also have a lot of general
algorithms that are applicable
to all kinds of data and those algorithms,
of course, are very useful.
Although, for a particular kind of data,
we generally want to also
develop a special algorithm.
So this course will cover
specialized algorithms that
are particularly useful for
mining text data.
[MUSIC]

[SOUND].
This lecture is about the syntagmatic
relation discovery, and entropy.
In this lecture, we're going to continue
talking about word association mining.
In particular, we're going to talk about
how to discover syntagmatic relations.
And we're going to start with
the introduction of entropy,
which is the basis for designing some
measures for discovering such relations.
By definition,
syntagmatic relations hold between words
that have correlated co-occurrences.
That means,
when we see one word occurs in context,
we tend to see the occurrence
of the other word.
So, take a more specific example, here.
We can ask the question,
whenever eats occurs,
what other words also tend to occur?
Looking at the sentences on the left,
we see some words that might occur
together with eats, like cat,
dog, or fish is right.
But if I take them out and
if you look at the right side where we
only show eats and some other words,
the question then is.
Can you predict what other words
occur to the left or to the right?
Right so
this would force us to think about what
other words are associated with eats.
If they are associated with eats,
they tend to occur in the context of eats.
More specifically our
prediction problem is to take
any text segment which can be a sentence,
a paragraph, or a document.
And then ask I the question,
is a particular word present or
absent in this segment?
Right here we ask about the word W.
Is W present or absent in this segment?
Now what's interesting is that
some words are actually easier
to predict than other words.
If you take a look at the three
words shown here, meat, the, and
unicorn, which one do you
think is easier to predict?
Now if you think about it for
a moment you might conclude that
the is easier to predict because
it tends to occur everywhere.
So I can just say,
well that would be in the sentence.
Unicorn is also relatively easy
because unicorn is rare, is very rare.
And I can bet that it doesn't
occur in this sentence.
But meat is somewhere in
between in terms of frequency.
And it makes it harder to predict because
it's possible that it occurs in a sentence
or the segment, more accurately.
But it may also not occur in the sentence,
so
now let's study this
problem more formally.
So the problem can be formally defined
as predicting the value of
a binary random variable.
Here we denote it by X sub w,
w denotes a word, so
this random variable is associated
with precisely one word.
When the value of the variable is 1,
it means this word is present.
When it's 0, it means the word is absent.
And naturally, the probabilities for
1 and 0 should sum to 1,
because a word is either present or
absent in a segment.
There's no other choice.
So the intuition with this concept earlier
can be formally stated as follows.
The more random this random variable is,
the more difficult the prediction will be.
Now the question is how does one
quantitatively measure the randomness of
a random variable like X sub w?
How in general, can we quantify
the randomness of a variable and
that's why we need a measure
called entropy and
this measure introduced in information
theory to measure the randomness of X.
There is also some connection
with information here but
that is beyond the scope of this course.
So for
our purpose we just treat entropy function
as a function defined
on a random variable.
In this case, it is a binary random
variable, although the definition can
be easily generalized for
a random variable with multiple values.
Now the function form looks like this,
there's the sum of all the possible
values for this random variable.
Inside the sum for each value we
have a product of the probability
that the random variable equals this
value and log of this probability.
And note that there is also
a negative sign there.
Now entropy in general is non-negative.
And that can be mathematically proved.
So if we expand this sum, we'll see that
the equation looks like the second one.
Where I explicitly plugged
in the two values, 0 and 1.
And sometimes when we have 0 log of 0,
we would generally define that as 0,
because log of 0 is undefined.
So this is the entropy function.
And this function will
give a different value for
different distributions
of this random variable.
And it clearly depends on the probability
that the random variable
taking value of 1 or 0.
If we plot this function against
the probability that the random
variable is equal to 1.
And then the function looks like this.
At the two ends,
that means when the probability of X
equals 1 is very small or very large,
then the entropy function has a low value.
When it's 0.5 in the middle
then it reaches the maximum.
Now if we plot the function
against the probability that X
is taking a value of 0 and the function
would show exactly the same curve here,
and you can imagine why.
And so that's because
the two probabilities are symmetric,
and completely symmetric.
So an interesting question you
can think about in general is for
what kind of X does entropy
reach maximum or minimum.
And we can in particular think
about some special cases.
For example, in one case,
we might have a random variable that
always takes a value of 1.
The probability is 1.
Or there's a random variable that
is equally likely taking a value of one or
zero.
So in this case the probability
that X equals 1 is 0.5.
Now which one has a higher entropy?
It's easier to look at the problem
by thinking of a simple example
using coin tossing.
So when we think about random
experiments like tossing a coin,
it gives us a random variable,
that can represent the result.
It can be head or tail.
So we can define a random variable
X sub coin, so that it's 1
when the coin shows up as head,
it's 0 when the coin shows up as tail.
So now we can compute the entropy
of this random variable.
And this entropy indicates how
difficult it is to predict the outcome
of a coin toss.
So we can think about the two cases.
One is a fair coin, it's completely fair.
The coin shows up as head or
tail equally likely.
So the two probabilities would be a half.
Right?
So both are equal to one half.
Another extreme case is
completely biased coin,
where the coin always shows up as heads.
So it's a completely biased coin.
Now let's think about
the entropies in the two cases.
And if you plug in these values you can
see the entropies would be as follows.
For a fair coin we see the entropy
reaches its maximum, that's 1.
For the completely biased coin,
we see it's 0.
And that intuitively makes a lot of sense.
Because a fair coin is
most difficult to predict.
Whereas a completely biased
coin is very easy to predict.
We can always say, well, it's a head.
Because it is a head all the time.
So they can be shown on
the curve as follows.
So the fair coin corresponds to the middle
point where it's very uncertain.
The completely biased coin
corresponds to the end
point where we have a probability
of 1.0 and the entropy is 0.
So, now let's see how we can use
entropy for word prediction.
Let's think about our problem is
to predict whether W is present or
absent in this segment.
Again, think about the three words,
particularly think about their entropies.
Now we can assume high entropy
words are harder to predict.
And so we now have a quantitative way to
tell us which word is harder to predict.
Now if you look at the three words meat,
the, unicorn, again, and
we clearly would expect meat to have
a higher entropy than the unicorn.
In fact if you look at the entropy of the,
it's close to zero.
Because it occurs everywhere.
So it's like a completely biased coin.
Therefore the entropy is zero.
[MUSIC]

[SOUND] This lecture is
about the syntagmatic
relation discovery and
conditional entropy.
In this lecture,
we're going to continue the discussion
of word association mining and analysis.
We're going to talk about the conditional
entropy, which is useful for
discovering syntagmatic relations.
Earlier, we talked about
using entropy to capture
how easy it is to predict the presence or
absence of a word.
Now, we'll address
a different scenario where
we assume that we know something
about the text segment.
So now the question is, suppose we know
that eats occurred in the segment.
How would that help us
predict the presence or
absence of water, like in meat?
And in particular, we want to
know whether the presence of eats
has helped us predict
the presence of meat.
And if we frame this using entrophy,
that would mean we are interested
in knowing whether knowing
the presence of eats could reduce
uncertainty about the meats.
Or, reduce the entrophy
of the random variable
corresponding to the presence or
absence of meat.
We can also ask as a question,
what if we know of the absents of eats?
Would that also help us predict
the presence or absence of meat?
These questions can be
addressed by using another
concept called a conditioning entropy.
So to explain this concept, let's first
look at the scenario we had before,
when we know nothing about the segment.
So we have these probabilities indicating
whether a word like meat occurs,
or it doesn't occur in the segment.
And we have an entropy function that
looks like what you see on the slide.
Now suppose we know eats is present, so
now we know the value of another
random variable that denotes eats.
Now, that would change all
these probabilities to
conditional probabilities.
Where we look at the presence or
absence of meat,
given that we know eats
occurred in the context.
So as a result,
if we replace these probabilities
with their corresponding conditional
probabilities in the entropy function,
we'll get the conditional entropy.
So this equation now here would be
the conditional entropy.
Conditional on the presence of eats.
So, you can see this is essentially
the same entropy function as you have
seen before, except that all
the probabilities now have a condition.
And this then tells us
the entropy of meat,
after we have known eats
occurring in the segment.
And of course, we can also define
this conditional entropy for
the scenario where we don't see eats.
So if we know it did not occur in
the segment, then this entry condition of
entropy would capture the instances
of meat in that condition.
So now,
putting different scenarios together,
we have the completed definition
of conditional entropy as follows.
Basically, we're going to consider both
scenarios of the value of eats zero, one,
and this gives us a probability
that eats is equal to zero or one.
Basically, whether eats is present or
absent.
And this of course,
is the conditional entropy of
meat in that particular scenario.
So if you expanded this entropy,
then you have the following equation.
Where you see the involvement of
those conditional probabilities.
Now in general, for any discrete
random variables x and y, we have
the conditional entropy is no larger
than the entropy of the variable x.
So basically, this is upper bound for
the conditional entropy.
That means by knowing more
information about the segment,
we want to be able to
increase uncertainty.
We can only reduce uncertainty.
And that intuitively makes sense
because as we know more information,
it should always help
us make the prediction.
And cannot hurt
the prediction in any case.
Now, what's interesting here is also to
think about what's the minimum possible
value of this conditional entropy?
Now, we know that the maximum
value is the entropy of X.
But what about the minimum,
so what do you think?
I hope you can reach the conclusion that
the minimum possible value, would be zero.
And it will be interesting to think about
under what situation will achieve this.
So, let's see how we can use conditional
entropy to capture syntagmatic relation.
Now of course,
this conditional entropy gives us directly
one way to measure
the association of two words.
Because it tells us to what extent,
we can predict the one
word given that we know the presence or
absence of another word.
Now before we look at the intuition
of conditional entropy in capturing
syntagmatic relations, it's useful to
think of a very special case, listed here.
That is, the conditional entropy
of the word given itself.
So here,
we listed this conditional
entropy in the middle.
So, it's here.
So, what is the value of this?
Now, this means we know where
the meat occurs in the sentence.
And we hope to predict whether
the meat occurs in the sentence.
And of course, this is 0 because
there's no incident anymore.
Once we know whether the word
occurs in the segment,
we'll already know the answer
of the prediction.
So this is zero.
And that's also when this conditional
entropy reaches the minimum.
So now, let's look at some other cases.
So this is a case of knowing the and
trying to predict the meat.
And this is a case of knowing eats and
trying to predict the meat.
Which one do you think is smaller?
No doubt smaller entropy means easier for
prediction.
Which one do you think is higher?
Which one is not smaller?
Well, if you at the uncertainty,
then in the first case,
the doesn't really tell
us much about the meat.
So knowing the occurrence of the doesn't
really help us reduce entropy that much.
So it stays fairly close to
the original entropy of meat.
Whereas in the case of eats,
eats is related to meat.
So knowing presence of eats or
absence of eats,
would help us predict whether meat occurs.
So it can help us reduce entropy of meat.
So we should expect the sigma term, namely
this one, to have a smaller entropy.
And that means there is a stronger
association between meat and eats.
So we now also know when
this w is the same as this
meat, then the conditional entropy
would reach its minimum, which is 0.
And for what kind of words
would either reach its maximum?
Well, that's when this stuff
is not really related to meat.
And like the for example,
it would be very close to the maximum,
which is the entropy of meat itself.
So this suggests that when you
use conditional entropy for
mining syntagmatic relations,
the hours would look as follows.
For each word W1, we're going to
enumerate the overall other words W2.
And then, we can compute
the conditional entropy of W1 given W2.
We thought all the candidate was in
ascending order of the conditional entropy
because we're out of favor,
a world that has a small entropy.
Meaning that it helps us predict
the time of the word W1.
And then, we're going to take the top ring
of the candidate words as words that have
potential syntagmatic relations with W1.
Note that we need to use
a threshold to find these words.
The stresser can be the number
of top candidates take, or
absolute value for
the conditional entropy.
Now, this would allow us to mine the most
strongly correlated words with
a particular word, W1 here.
But, this algorithm does not
help us mine the strongest
that K syntagmatical relations
from an entire collection.
Because in order to do that, we have to
ensure that these conditional entropies
are comparable across different words.
In this case of discovering
the mathematical relations for
a targeted word like W1, we only need
to compare the conditional entropies
for W1, given different words.
And in this case, they are comparable.
All right.
So, the conditional entropy of W1, given
W2, and the conditional entropy of W1,
given W3 are comparable.
They all measure how hard
it is to predict the W1.
But, if we think about the two pairs,
where we share W2 in the same condition,
and we try to predict the W1 and W3.
Then, the conditional entropies
are actually not comparable.
You can think of about this question.
Why?
So why are they not comfortable?
Well, that was because they
have a different outer bounds.
Right?
So those outer bounds are precisely
the entropy of W1 and the entropy of W3.
And they have different upper bounds.
So we cannot really
compare them in this way.
So how do we address this problem?
Well later, we'll discuss, we can use
mutual information to solve this problem.
[MUSIC]

[SOUND].
This lecture is about the syntagmatic
relation discovery and mutual information.
In this lecture we are going to continue
discussing syntagmatic relation discovery.
In particular,
we are going to talk about another
the concept in the information series,
we called it mutual information and
how it can be used to discover
syntagmatic relations.
Before we talked about the problem
of conditional entropy and
that is the conditional entropy
computed different pairs of words.
It is not really comparable, so
that makes it harder with this cover,
strong synagmatic relations
globally from corpus.
So now we are going to introduce mutual
information, which is another concept
in the information series
that allows us to, sometimes,
normalize the conditional entropy to make
it more comparable across different pairs.
In particular, mutual information
in order to find I(X:Y),
matches the entropy reduction
of X obtained from knowing Y.
More specifically the question we
are interested in here is how much
of an entropy of X can
we obtain by knowing Y.
So mathematically it can be
defined as the difference between
the original entropy of X, and
the condition of Y of X given Y.
And you might see,
as you can see here it can also be defined
as reduction of entropy of
Y because of knowing X.
Now normally the two conditional
interface H of X given Y and
the entropy of Y given X are not equal,
but interestingly,
the reduction of entropy by knowing
one of them, is actually equal.
So, this quantity is called a Mutual
Information in order to buy I here.
And this function has some interesting
properties, first it is also non-negative.
This is easy to understand because
the original entropy is always
not going to be lower than the possibility
reduced conditional entropy.
In other words, the conditional entropy
will never exceed the original entropy.
Knowing some information can
always help us potentially, but
will not hurt us in predicting x.
The signal property is that it
is symmetric like additional
entropy is not symmetrical,
mutual information is, and
the third property is that It
reaches its minimum, zero, if and
only if the two random variables
are completely independent.
That means knowing one of them does not
tell us anything about the other and
this last property can be verified by
simply looking at the equation above and
it reaches 0 if and
only the conditional entropy of X
[INAUDIBLE] Y is exactly the same
as original entropy of X.
So that means knowing why it did not
help at all and that is when X and
a Y are completely independent.
Now when we fix X to rank different
Ys using conditional entropy
would give the same order as
ranking based on mutual information
because in the function here,
H(X) is fixed because X is fixed.
So ranking based on mutual entropy is
exactly the same as ranking based on
the conditional entropy of X given Y, but
the mutual information allows us to
compare different pairs of x and y.
So, that is why mutual information is
more general and in general, more useful.
So, let us examine the intuition
of using mutual information for
Syntagmatical Relation Mining.
Now, the question we ask forcing
that relation mining is,
whenever "eats" occurs,
what other words also tend to occur?
So this question can be framed as
a mutual information question, that is,
which words have high mutual
information was eats,
so computer the missing information
between eats and other words.
And if we do that, and it is basically
a base on the same as conditional
we will see that words that
are strongly associated with eats,
will have a high point.
Whereas words that are not related
will have lower mutual information.
For this, I will give some example here.
The mutual information between "eats" and
"meats",
which is the same as between "meats" and
"eats," because the information is
symmetrical is expected to be higher than
the mutual information between eats and
the, because knowing the does not
really help us as a predictor.
It is similar, and
knowing eats does not help us predicting,
the as well.
And you also can easily
see that the mutual
information between a word and
itself is the largest,
which is equal to
the entropy of this word and
so, because in this case the reduction is
maximum because knowing one allows
us to predict the other completely.
So the conditional entropy is zero,
therefore the mutual information
reaches its maximum.
It is going to be larger, then are equal
to the machine volume eats in other words.
In other words picking any other word and
the computer picking between eats and
that word.
You will not get any information larger
the computation from eats and itself.
So now let us look at how to
compute the mute information.
Now in order to do that, we often
use a different form of mutual
information, and we can mathematically
rewrite the mutual information
into the form shown on this slide.
Where we essentially see
a formula that computes what is
called a KL-divergence or divergence.
This is another term
in information theory.
It measures the divergence
between two distributions.
Now, if you look at the formula,
it is also sum over many combinations of
different values of the two random
variables but inside the sum,
mainly we are doing a comparison
between two joint distributions.
The numerator has the joint,
actual observed the joint distribution
of the two random variables.
The bottom part or the denominator can be
interpreted as the expected joint
distribution of the two random variables,
if they were independent because when
two random variables are independent,
they are joined distribution is equal to
the product of the two probabilities.
So this comparison will tell us whether
the two variables are indeed independent.
If they are indeed independent then we
would expect that the two are the same,
but if the numerator is different
from the denominator, that would mean
the two variables are not independent and
that helps measure the association.
The sum is simply to take into
consideration of all of the combinations
of the values of these
two random variables.
In our case, each random variable
can choose one of the two values,
zero or one, so
we have four combinations here.
If we look at this form of mutual
information, it shows that the mutual
information matches the divergence
of the actual joint distribution
from the expected distribution
under the independence assumption.
The larger this divergence is, the higher
the mutual information would be.
So now let us further look at what
are exactly the probabilities,
involved in this formula
of mutual information.
And here, this is all the probabilities
involve, and it is easy for
you to verify that.
Basically, we have first to
[INAUDIBLE] probabilities
corresponding to the presence or
absence of each word.
So, for w1,
we have two probabilities shown here.
They should sum to one, because a word
can either be present or absent.
In the segment, and similarly for
the second word, we also have two
probabilities representing presence or
absences of this word, and
there is some to y as well.
And finally, we have a lot of
joined probabilities that represent
the scenarios of co-occurrences of
the two words, and they are shown here.
And they sum to one because the two
words can only have these four
possible scenarios.
Either they both occur, so
in that case both variables will have
a value of one, or one of them occurs.
There are two scenarios.
In these two cases one of the random
variables will be equal to one and
the other will be zero and finally we have
the scenario when none of them occurs.
This is when the two variables
taking a value of zero.
So these are the probabilities involved
in the calculation of mutual information,
over here.
Once we know how to calculate
these probabilities,
we can easily calculate
the new gene formation.
It is also interesting to know that
there are actually some relations or
constraint among these probabilities,
and we already saw two of them, right?
So in the previous slide,
that you have seen that
the marginal probabilities of these
words sum to one and
we also have seen this constraint,
that says the two words have these
four scenarios of co-occurrency,
but we also have some additional
constraints listed in the bottom.
For example, this one means if we add up
the probabilities that we observe
the two words occur together and
the probabilities when the first word
occurs and the second word does not occur.
We get exactly the probability
that the first word is observed.
In other words, when the word is observed.
When the first word is observed, and
there are only two scenarios, depending on
whether the second word is also observed.
So, this probability captures the first
scenario when the second word
actually is also observed, and
this captures the second scenario
when the second word is not observed.
So, we only see the first word, and
it is easy to see the other equations
also follow the same reasoning.
Now these equations allow us to
compute some probabilities based on
other probabilities, and
this can simplify the computation.
So more specifically,
if we know the probability that
a word is present, like in this case,
so if we know this, and
if we know the probability of
the presence of the second word,
then we can easily compute
the absence probability, right?
It is very easy to use this
equation to do that, and so
we take care of the computation of
these probabilities of presence and
absence of each word.
Now let's look at
the [INAUDIBLE] distribution.
Let us assume that we also have available
the probability that
they occurred together.
Now it is easy to see that we can
actually compute all the rest of these
probabilities based on these.
Specifically for
example using this equation we can compute
the probability that the first word
occurred and the second word did not,
because we know these probabilities in
the boxes, and similarly using this
equation we can compute the probability
that we observe only the second word.
Word.
And then finally,
this probability can be calculated
by using this equation because
now this is known, and
this is also known, and
this is already known, right.
So this can be easier to calculate.
So now this can be calculated.
So this slide shows that we only
need to know how to compute
these three probabilities
that are shown in the boxes,
naming the presence of each word and the
co-occurence of both words, in a segment.
[MUSIC]

[SOUND]
In general, we can use the empirical count
of events in the observed data
to estimate the probabilities.
And a commonly used technique is
called a maximum likelihood estimate,
where we simply normalize
the observe accounts.
So if we do that, we can see, we can
compute these probabilities as follows.
For estimating the probability that
we see a water current in a segment,
we simply normalize the count of
segments that contain this word.
So let's first take
a look at the data here.
On the right side, you see a list of some,
hypothesizes the data.
These are segments.
And in some segments you see both words
occur, they are indicated as ones for
both columns.
In some other cases only one will occur,
so only that column has one and
the other column has zero.
And in all, of course, in some other
cases none of the words occur,
so they are both zeros.
And for estimating these probabilities, we
simply need to collect the three counts.
So the three counts are first,
the count of W1.
And that's the total number of
segments that contain word W1.
It's just as the ones in the column of W1.
We can count how many
ones we have seen there.
The segment count is for word 2, and we
just count the ones in the second column.
And these will give us the total
number of segments that contain W2.
The third count is when both words occur.
So this time, we're going to count
the sentence where both columns have ones.
And then, so this would give us
the total number of segments
where we have seen both W1 and W2.
Once we have these counts,
we can just normalize these counts by N,
which is the total number of segments, and
this will give us the probabilities that
we need to compute original information.
Now, there is a small problem,
when we have zero counts sometimes.
And in this case, we don't want a zero
probability because our data may be
a small sample and in general, we would
believe that it's potentially possible for
a [INAUDIBLE] to avoid any context.
So, to address this problem,
we can use a technique called smoothing.
And that's basically to add some
small constant to these counts,
and so that we don't get
the zero probability in any case.
Now, the best way to understand smoothing
is imagine that we actually observed more
data than we actually have, because we'll
pretend we observed some pseudo-segments.
I illustrated on the top,
on the right side on the slide.
And these pseudo-segments would
contribute additional counts
of these words so
that no event will have zero probability.
Now, in particular we introduce
the four pseudo-segments.
Each is weighted at one quarter.
And these represent the four different
combinations of occurrences of this word.
So now each event,
each combination will have
at least one count or at least a non-zero
count from this pseudo-segment.
So, in the actual segments
that we'll observe,
it's okay if we haven't observed
all of the combinations.
So more specifically, you can see
the 0.5 here after it comes from the two
ones in the two pseudo-segments,
because each is weighted at one quarter.
We add them up, we get 0.5.
And similar to this,
0.05 comes from one single
pseudo-segment that indicates
the two words occur together.
And of course in the denominator we add
the total number of pseudo-segments that
we add, in this case,
we added a four pseudo-segments.
Each is weighed at one quarter so
the total of the sum is, after the one.
So, that's why in the denominator
you'll see a one there.
So, this basically concludes
the discussion of how to compute a these
four syntagmatic relation discoveries.
Now, so to summarize,
syntagmatic relation can generally
be discovered by measuring correlations
between occurrences of two words.
We've introduced the three
concepts from information theory.
Entropy, which measures the uncertainty
of a random variable X.
Conditional entropy, which measures
the entropy of X given we know Y.
And mutual information of X and Y,
which matches the entropy reduction of X
due to knowing Y, or
entropy reduction of Y due to knowing X.
They are the same.
So these three concepts are actually very
useful for other applications as well.
That's why we spent some time
to explain this in detail.
But in particular,
they are also very useful for
discovering syntagmatic relations.
In particular,
mutual information is a principal way for
discovering such a relation.
It allows us to have values
computed on different pairs of
words that are comparable and
so we can rank these pairs and
discover the strongest syntagmatic
from a collection of documents.
Now, note that there is some relation
between syntagmatic relation discovery and
[INAUDIBLE] relation discovery.
So we already discussed the possibility
of using BM25 to achieve waiting for
terms in the context to potentially
also suggest the candidates
that have syntagmatic relations
with the candidate word.
But here, once we use mutual information
to discover syntagmatic relations,
we can also represent the context with
this mutual information as weights.
So this would give us
another way to represent
the context of a word, like a cat.
And if we do the same for all the words,
then we can cluster these words or
compare the similarity between these
words based on their context similarity.
So this provides yet
another way to do term weighting for
paradigmatic relation discovery.
And so to summarize this whole part
about word association mining.
We introduce two basic associations,
called a paradigmatic and
a syntagmatic relations.
These are fairly general, they apply
to any items in any language, so
the units don't have to be words,
they can be phrases or entities.
We introduced multiple statistical
approaches for discovering them,
mainly showing that pure
statistical approaches are visible,
are variable for
discovering both kind of relations.
And they can be combined to
perform joint analysis, as well.
These approaches can be applied
to any text with no human effort,
mostly because they are based
on counting of words, yet
they can actually discover
interesting relations of words.
We can also use different ways with
defining context and segment, and
this would lead us to some interesting
variations of applications.
For example, the context can be very
narrow like a few words, around a word, or
a sentence, or maybe paragraphs,
as using differing contexts would
allows to discover different flavors
of paradigmatical relations.
And similarly,
counting co-occurrences using let's say,
visual information to discover
syntagmatical relations.
We also have to define the segment, and
the segment can be defined as a narrow
text window or a longer text article.
And this would give us different
kinds of associations.
These discovery associations can
support many other applications,
in both information retrieval and
text and data mining.
So here are some recommended readings,
if you want to know more about the topic.
The first is a book with
a chapter on collocations,
which is quite relevant to
the topic of these lectures.
The second is an article
about using various
statistical measures to
discover lexical atoms.
Those are phrases that
are non-compositional.
For example,
hot dog is not really a dog that's hot,
blue chip is not a chip that's blue.
And the paper has a discussion about some
techniques for discovering such phrases.
The third one is a new paper on a unified
way to discover both paradigmatical
relations and a syntagmatical relations,
using random works on word graphs.
[SOUND]

[SOUND]
So,
looking at the text mining problem more
closely, we see that the problem is
similar to general data mining, except
that we'll be focusing more on text data.
And we're going to have text mining
algorithms to help us to turn text data
into actionable knowledge that
we can use in real world,
especially for decision making, or
for completing whatever tasks that
require text data to support.
Because, in general,
in many real world problems of data mining
we also tend to have other kinds
of data that are non-textual.
So a more general picture would be
to include non-text data as well.
And for this reason we might be
concerned with joint mining of text and
non-text data.
And so in this course we're
going to focus more on text mining,
but we're also going to also touch how do
to joint analysis of both text data and
non-text data.
With this problem definition we
can now look at the landscape of
the topics in text mining and analytics.
Now this slide shows the process of
generating text data in more detail.
More specifically, a human sensor or
human observer would look at
the word from some perspective.
Different people would be looking at
the world from different angles and
they'll pay attention to different things.
The same person at different times might
also pay attention to different aspects
of the observed world.
And so the humans are able to perceive
the world from some perspective.
And that human, the sensor,
would then form a view of the world.
And that can be called the Observed World.
Of course, this would be different from
the Real World because of the perspective
that the person has taken
can often be biased also.
Now the Observed World can be
represented as, for example,
entity-relation graphs or
in a more general way,
using knowledge representation language.
But in general, this is basically what
a person has in mind about the world.
And we don't really know what
exactly it looks like, of course.
But then the human would
express what the person has
observed using a natural language,
such as English.
And the result is text data.
Of course a person could have used
a different language to express what he or
she has observed.
In that case we might have text data of
mixed languages or different languages.
The main goal of text mining
Is actually to revert this
process of generating text data.
We hope to be able to uncover
some aspect in this process.
Specifically, we can think about mining,
for example, knowledge about the language.
And that means by looking at text data
in English, we may be able to discover
something about English, some usage
of English, some patterns of English.
So this is one type of mining problems,
where the result is
some knowledge about language which
may be useful in various ways.
If you look at the picture,
we can also then mine knowledge
about the observed world.
And so this has much to do with
mining the content of text data.
We're going to look at what the text
data are about, and then try to
get the essence of it or
extracting high quality information
about a particular aspect of
the world that we're interested in.
For example, everything that has been
said about a particular person or
a particular entity.
And this can be regarded as mining content
to describe the observed world in
the user's mind or the person's mind.
If you look further,
then you can also imagine
we can mine knowledge about this observer,
himself or herself.
So this has also to do with
using text data to infer
some properties of this person.
And these properties could
include the mood of the person or
sentiment of the person.
And note that we distinguish
the observed word from the person
because text data can't describe what the
person has observed in an objective way.
But the description can be also
subjected with sentiment and so,
in general, you can imagine the text
data would contain some factual
descriptions of the world plus
some subjective comments.
So that's why it's also possible to
do text mining to mine
knowledge about the observer.
Finally, if you look at the picture
to the left side of this picture,
then you can see we can certainly also
say something about the real world.
Right?
So indeed we can do text mining to
infer other real world variables.
And this is often called
a predictive analytics.
And we want to predict the value
of certain interesting variable.
So, this picture basically covered
multiple types of knowledge that
we can mine from text in general.
When we infer other
real world variables we
could also use some of the results from
mining text data as intermediate
results to help the prediction.
For example,
after we mine the content of text data we
might generate some summary of content.
And that summary could be then used
to help us predict the variables
of the real world.
Now of course this is still generated
from the original text data,
but I want to emphasize here that
often the processing of text data
to generate some features that can help
with the prediction is very important.
And that's why here we show the results of
some other mining tasks, including
mining the content of text data and
mining knowledge about the observer,
can all be very helpful for prediction.
In fact, when we have non-text data,
we could also use the non-text
data to help prediction, and
of course it depends on the problem.
In general, non-text data can be very
important for such prediction tasks.
For example,
if you want to predict stock prices or
changes of stock prices based on
discussion in the news articles or
in social media, then this is an example
of using text data to predict
some other real world variables.
But in this case, obviously,
the historical stock price data would
be very important for this prediction.
And so that's an example of
non-text data that would be very
useful for the prediction.
And we're going to combine both kinds
of data to make the prediction.
Now non-text data can be also used for
analyzing text by supplying context.
When we look at the text data alone,
we'll be mostly looking at the content
and/or opinions expressed in the text.
But text data generally also
has context associated.
For example, the time and the location
that associated are with the text data.
And these are useful context information.
And the context can provide interesting
angles for analyzing text data.
For example, we might partition text
data into different time periods
because of the availability of the time.
Now we can analyze text data in each
time period and then make a comparison.
Similarly we can partition text
data based on locations or
any meta data that's associated to
form interesting comparisons in areas.
So, in this sense,
non-text data can actually provide
interesting angles or
perspectives for text data analysis.
And it can help us make context-sensitive
analysis of content or
the language usage or
the opinions about the observer or
the authors of text data.
We could analyze the sentiment
in different contexts.
So this is a fairly general landscape of
the topics in text mining and analytics.
In this course we're going to
selectively cover some of those topics.
We actually hope to cover
most of these general topics.
First we're going to cover
natural language processing very
briefly because this has to do
with understanding text data and
this determines how we can represent
text data for text mining.
Second, we're going to talk about how to
mine word associations from text data.
And word associations is a form of use for
lexical knowledge about a language.
Third, we're going to talk about
topic mining and analysis.
And this is only one way to
analyze content of text, but
it's a very useful ways
of analyzing content.
It's also one of the most useful
techniques in text mining.
Then we're going to talk about
opinion mining and sentiment analysis.
So this can be regarded as one example
of mining knowledge about the observer.
And finally we're going to
cover text-based prediction
problems where we try to predict some
real world variable based on text data.
So this slide also serves as
a road map for this course.
And we're going to use
this as an outline for
the topics that we'll cover
in the rest of this course.
[MUSIC]

[SOUND]
This lecture is about natural language
content analysis.
Natural language content analysis
is the foundation of text mining.
So we're going to first talk about this.
And in particular,
natural language processing with
a factor how we can present text data.
And this determines what algorithms can
be used to analyze and mine text data.
We're going to take a look at the basic
concepts in natural language first.
And I'm going to explain these concepts
using a similar example
that you've all seen here.
A dog is chasing a boy on the playground.
Now this is a very simple sentence.
When we read such a sentence
we don't have to think
about it to get the meaning of it.
But when a computer has to
understand the sentence,
the computer has to go
through several steps.
First, the computer needs
to know what are the words,
how to segment the words in English.
And this is very easy,
we can just look at the space.
And then the computer will need
the know the categories of these words,
syntactical categories.
So for example, dog is a noun,
chasing's a verb, boy is another noun etc.
And this is called a Lexical analysis.
In particular, tagging these words
with these syntactic categories
is called a part-of-speech tagging.
After that the computer also needs to
figure out the relationship between
these words.
So a and dog would form a noun phrase.
On the playground would be
a prepositional phrase, etc.
And there is certain way for
them to be connected together in order for
them to create meaning.
Some other combinations
may not make sense.
And this is called syntactical parsing, or
syntactical analysis,
parsing of a natural language sentence.
The outcome is a parse tree
that you are seeing here.
That tells us the structure
of the sentence, so
that we know how we can
interpret this sentence.
But this is not semantics yet.
So in order to get the meaning we
would have to map these phrases and
these structures into some real world
antithesis that we have in our mind.
So dog is a concept that we know,
and boy is a concept that we know.
So connecting these phrases
that we know is understanding.
Now for a computer, would have to formally
represent these entities by using symbols.
So dog, d1 means d1 is a dog.
Boy, b1 means b1 refers to a boy etc.
And also represents the chasing
action as a predicate.
So, chasing is a predicate here with
three arguments, d1, b1, and p1.
Which is playground.
So this formal rendition of
the semantics of this sentence.
Once we reach that level of understanding,
we might also make inferences.
For example, if we assume there's a rule
that says if someone's being chased then
the person can get scared, then we
can infer this boy might be scared.
This is the inferred meaning,
based on additional knowledge.
And finally, we might even further infer
what this sentence is requesting,
or why the person who say it in
a sentence, is saying the sentence.
And so, this has to do with
purpose of saying the sentence.
This is called speech act analysis or
pragmatic analysis.
Which first to the use of language.
So, in this case a person saying this
may be reminding another person to
bring back the dog.
So this means when saying a sentence,
the person actually takes an action.
So the action here is to make a request.
Now, this slide clearly shows that
in order to really understand
a sentence there are a lot of
things that a computer has to do.
Now, in general it's very hard for
a computer will do everything,
especially if you would want
it to do everything correctly.
This is very difficult.
Now, the main reason why natural
language processing is very difficult,
it's because it's designed it will
make human communications efficient.
As a result, for example,
with only a lot of common sense knowledge.
Because we assume all of
us have this knowledge,
there's no need to encode this knowledge.
That makes communication efficient.
We also keep a lot of ambiguities,
like, ambiguities of words.
And this is again, because we assume we
have the ability to disambiguate the word.
So, there's no problem with
having the same word to mean
possibly different things
in different context.
Yet for
a computer this would be very difficult
because a computer does not have
the common sense knowledge that we do.
So the computer will be confused indeed.
And this makes it hard for
natural language processing.
Indeed, it makes it very hard for
every step in the slide
that I showed you earlier.
Ambiguity is a main killer.
Meaning that in every step
there are multiple choices,
and the computer would have to
decide whats the right choice and
that decision can be very difficult
as you will see also in a moment.
And in general,
we need common sense reasoning in order
to fully understand the natural language.
And computers today don't yet have that.
That's why it's very hard for
computers to precisely understand
the natural language at this point.
So here are some specific
examples of challenges.
Think about the world-level ambiguity.
A word like design can be a noun or
a verb, so
we've got ambiguous part of speech tag.
Root also has multiple meanings,
it can be of mathematical sense,
like in the square of, or
can be root of a plant.
Syntactic ambiguity refers
to different interpretations
of a sentence in terms structures.
So for example,
natural language processing can
actually be interpreted in two ways.
So one is the ordinary meaning that we
will be getting as we're
talking about this topic.
So, it's processing of natural language.
But there's is also another
possible interpretation
which is to say language
processing is natural.
Now we don't generally have this problem,
but imagine for the computer to determine
the structure, the computer would have
to make a choice between the two.
Another classic example is a man
saw a boy with a telescope.
And this ambiguity lies in
the question who had the telescope?
This is called a prepositional
phrase attachment ambiguity.
Meaning where to attach this
prepositional phrase with the telescope.
Should it modify the boy?
Or should it be modifying, saw, the verb.
Another problem is anaphora resolution.
In John persuaded Bill to buy a TV for
himself.
Does himself refer to John or Bill?
Presupposition is another difficulty.
He has quit smoking implies
that he smoked before, and
we need to have such a knowledge in
order to understand the languages.
Because of these problems, the state
of the art natural language processing
techniques can not do anything perfectly.
Even for
the simplest part of speech tagging,
we still can not solve the whole problem.
The accuracy that are listed here,
which is about 97%,
was just taken from some studies earlier.
And these studies obviously have to
be using particular data sets so
the numbers here are not
really meaningful if you
take it out of the context of the data
set that are used for evaluation.
But I show these numbers mainly to give
you some sense about the accuracy,
or how well we can do things like this.
It doesn't mean any data set
accuracy would be precisely 97%.
But, in general, we can do parsing speech
tagging fairly well although not perfect.
Parsing would be more difficult, but for
partial parsing, meaning to get some
phrases correct, we can probably
achieve 90% or better accuracy.
But to get the complete parse tree
correctly is still very, very difficult.
For semantic analysis, we can also do
some aspects of semantic analysis,
particularly, extraction of entities and
relations.
For example, recognizing this is
the person, that's a location, and
this person and
that person met in some place etc.
We can also do word sense to some extent.
The occurrence of root in this sentence
refers to the mathematical sense etc.
Sentiment analysis is another aspect
of semantic analysis that we can do.
That means we can tag the senses
as generally positive when
it's talking about the product or
talking about the person.
Inference, however, is very hard,
and we generally cannot do that for
any big domain and if it's only
feasible for a very limited domain.
And that's a generally difficult
problem in artificial intelligence.
Speech act analysis is
also very difficult and
we can only do this probably for
very specialized cases.
And with a lot of help from humans
to annotate enough data for
the computers to learn from.
So the slide also shows that
computers are far from being able to
understand natural language precisely.
And that also explains why the text
mining problem is difficult.
Because we cannot rely on
mechanical approaches or
computational methods to
understand the language precisely.
Therefore, we have to use
whatever we have today.
A particular statistical machine learning
method of statistical analysis methods
to try to get as much meaning
out from the text as possible.
And, later you will see
that there are actually
many such algorithms
that can indeed extract
interesting model from text even though
we cannot really fully understand it.
Meaning of all the natural
language sentences precisely.
[MUSIC]

[SOUND]
So here are some specific examples of what
we can't do today and
part of speech tagging is still
not easy to do 100% correctly.
So in the example, he turned off the
highway verses he turned off the fan and
the two offs actually have somewhat
a differentness in their active
categories and also its very difficult
to get a complete the parsing correct.
Again, the example, a man saw a boy
with a telescope can actually
be very difficult to parse
depending on the context.
Precise deep semantic
analysis is also very hard.
For example, to define the meaning of own,
precisely is very difficult in
the sentence, like John owns a restaurant.
So the state of the off can
be summarized as follows.
Robust and
general NLP tends to be shallow while
a deep understanding does not scale up.
For this reason in this course,
the techniques that we cover are in
general, shallow techniques for
analyzing text data and
mining text data and they are generally
based on statistical analysis.
So there are robust and
general and they are in
the in category of shallow analysis.
So such techniques have
the advantage of being able to be
applied to any text data in
any natural about any topic.
But the downside is that, they don't
give use a deeper understanding of text.
For that, we have to rely on
deeper natural language analysis.
That typically would require
a human effort to annotate
a lot of examples of analysis that would
like to do and then computers can use
machine learning techniques and learn from
these training examples to do the task.
So in practical applications, we generally
combine the two kinds of techniques
with the general statistical and
methods as a backbone as the basis.
These can be applied to any text data.
And on top of that, we're going to use
humans to, and you take more data and
to use supervised machine learning
to do some tasks as well as we can,
especially for those important
tasks to bring humans into the loop
to analyze text data more precisely.
But this course will cover
the general statistical approaches
that generally,
don't require much human effort.
So they're practically,
more useful that some of the deeper
analysis techniques that require a lot of
human effort to annotate the text today.
So to summarize,
the main points we take are first NLP
is the foundation for text mining.
So obviously, the better we
can understand the text data,
the better we can do text mining.
Computers today are far from being able
to understand the natural language.
Deep NLP requires common sense
knowledge and inferences.
Thus, only working for
very limited domains not feasible for
large scale text mining.
Shallow NLP based on statistical
methods can be done in large scale and
is the main topic of this course and
they are generally applicable
to a lot of applications.
They are in some sense also,
more useful techniques.
In practice,
we use statistical NLP as the basis and
we'll have humans for
help as needed in various ways.
[MUSIC]

[SOUND] This lecture is
about Text Representation.
In this lecture we're going to discuss
text representation and discuss how
natural language processing can allow us
to represent text in many different ways.
Let's take a look at this
example sentence again.
We can represent this sentence
in many different ways.
First, we can always represent such
a sentence as a string of characters.
This is true for all the languages.
When we store them in the computer.
When we store a natural language
sentence as a string of characters.
We have perhaps the most general
way of representing text since
we can always use this approach
to represent any text data.
But unfortunately using such
a representation will not help us to
semantic analysis, which is often needed
for many applications of text mining.
The reason is because we're
not even recognizing words.
So as a string we are going to keep all
of the spaces and these ascii symbols.
We can perhaps count out what's
the most frequent character in
the English text or
the correlation between those characters.
But we can't really analyze semantics, yet
this is the most general way of
representing text because we
hadn't used this to represent
any natural language or text.
If we try to do a little bit more
natural language processing by
doing word segmentation,
then we can obtain a representation
of the same text, but
in the form of a sequence of words.
So here we see that we can identify words,
like a dog is chasing, etc.
Now with this level of representation
we suddenly can do a lot of things.
And this is mainly because words are the
basic units of human communication and
natural language.
So they are very powerful.
By identifying words, we can for
example, easily count what
are the most frequent words in this
document or in the whole collection, etc.
And these words can be
used to form topics.
When we combine related words together and
some words positive and
some words are negatives or
we can also do analysis.
So representing text data as a sequence
of words opens up a lot of interesting
analysis possibilities.
However, this level of representation
is slightly less general than string of
characters.
Because in some languages, such as
Chinese, it's actually not that easy to
identified all the word boundaries,
because in such a language you see
text as a sequence of characters
with no space in between.
So you have to rely on some special
techniques to identify words.
In such a language of course then we
might make mistakes in segmenting words.
So the sequence of words representation
is not as robust as string of characters.
But in English, it's very easy to
obtain this level of representation.
So we can do that all the time.
Now if we go further to do in that round
of processing we can add a part of
these text.
Now once we do that we can count, for
example, the most frequent nouns or
what kind of nouns are associated
with what kind of verbs, etc.
So, this opens up a little bit
more interesting opportunities for
further analysis.
Note that I use a plus sign here because
by representing text as a sequence
of part of speech tags,
we don't necessarily replace
the original word sequence written.
Instead, we add this as an additional
way or representing text data.
So now the data is represented
as both a sequence of words and
a sequence of part of speech tags.
This enriches the representation
of text data, and,
thus also enables a more
interesting analysis.
If we go further,
then we'll be pausing the sentence
to obtain a syntactic structure.
Now this of course will
further open up more
interesting analysis of, for example,
the writing styles or
correcting grammar mistakes.
If we go further for semantic analysis.
Then we might be able to
recognize dog as an animal.
And we also can recognize boy as a person,
and playground as a location.
And we can further
analyse their relations.
For example, dog was chasing the boy,
and boy is on the playground.
This will add more entities and relations,
through entity relation recreation.
At this level,
we can do even more interesting things.
For example, now we can counter
easily the most frequent person
that's managing this whole
collection of news articles.
Or whenever you mention this person
you also tend to see mentioning
of another person, etc.
So this is very a useful representation.
And it's also related to the knowledge
graph that some of you may have heard of
that Google is doing as a more semantic
way of representing text data.
However it's also less
robust sequence of words.
Or even syntactical analysis,
because it's not always easy
to identify all the entities with the
right types and we might make mistakes.
And relations are even harder to find and
we might make mistakes.
This makes this level of representation
less robust, yet it's very useful.
Now if we move further to logic group
condition then we have predicates and
inference rules.
With inference rules we can infer
interesting derived facts from the text.
So that's very useful but
unfortunately, this level of
representation is even less robust and
we can make mistakes.
And we can't do that all the time for
all kinds of sentences.
And finally speech acts would add a yet
another level of rendition of
the intent of saying this sentence.
So in this case it might be a request.
So knowing that would allow us to you
know analyze more even more interesting
things about the observer or
the author of this sentence.
What's the intention of saying that?
What scenarios or
what kind of actions will be made?
So this is, Another role of analysis
that would be very interesting.
So this picture shows that if
we move down, we generally see
more sophisticated and natural language
processing techniques will be used.
And unfortunately such techniques
would require more human effort.
And they are less accurate.
That means there are mistakes.
So if we analyze our text at
the levels that are representing
deeper analysis of language then
we have to tolerate errors.
So that also means it's still necessary
to combine such deep analysis
with shallow analysis based on,
for example, sequence of words.
On the right side, you see the arrow
points down to indicate that
as we go down, with our representation of
text is closer to knowledge representation
in our mind and need for
solving a lot of problems.
Now, this is desirable because as we can
represent text as a level of knowledge,
we can easily extract the knowledge.
That's the purpose of text mining.
So, there was a trade off here.
Between doing deeper analysis
that might have errors but
would give us direct knowledge
that can be extracted from text.
And doing shadow analysis
which is more robust but
wouldn't actually give us the necessary
deeper representation of knowledge.
I should also say that text
data are generated by humans,
and are meant to be consumed by humans.
So as a result, in text data analysis,
text mining,
humans play a very important role.
They are always in the loop,
meaning that we should optimize
a collaboration of humans and computers.
So, in that sense it's okay that
computers may not be able to
have completely accurate
representation of text data.
And patterns that are extracted from
text data can be interpreted by humans.
And then humans can guide the computers to
do more accurate analysis by annotating
more data, by providing features to
guide machine learning programs,
to make them work more effectively.
[MUSIC]

[SOUND].
So, as we explained the different text
representation tends to
enable different analysis.
In particular,
we can gradually add more and
more deeper analysis results
to represent text data.
And that would open up a more
interesting representation
opportunities and
also analysis capacities.
So, this table summarizes
what we have just seen.
So the first column shows
the text representation.
The second visualizes the generality
of such a representation.
Meaning whether we can do this
kind of representation accurately for
all the text data or only some of them.
And the third column shows
the enabled analysis techniques.
And the final column shows some
examples of application that
can be achieved through this
level of representation.
So let's take a look at them.
So as a stream text can only be processed
by stream processing algorithms.
It's very robust, it's general.
And there was still some interesting
applications that can be down
at this level.
For example, compression of text.
Doesn't necessarily need to
know the word boundaries.
Although knowing word boundaries
might actually also help.
Word base repetition is a very
important level of representation.
It's quite general and
relatively robust, indicating they
were a lot of analysis techniques.
Such as word relation analysis,
topic analysis and sentiment analysis.
And there are many applications that can
be enabled by this kind of analysis.
For example, thesaurus discovery has
to do with discovering related words.
And topic and
opinion related applications are abounded.
And there are, for example, people
might be interesting in knowing the major
topics covered in the collection of texts.
And this can be the case
in research literature.
And scientists want to know what are the
most important research topics today.
Or customer service people might want to
know all our major complaints from their
customers by mining their e-mail messages.
And business intelligence
people might be interested in
understanding consumers' opinions about
their products and the competitors'
products to figure out what are the
winning features of their products.
And, in general, there are many
applications that can be enabled by
the representation at this level.
Now, moving down, we'll see we can
gradually add additional representations.
By adding syntactical structures,
we can enable, of course,
syntactical graph analysis.
We can use graph mining algorithms
to analyze syntactic graphs.
And some applications are related
to this kind of representation.
For example,
stylistic analysis generally requires
syntactical structure representation.
We can also generate
the structure based features.
And those are features that might help us
classify the text objects into different
categories by looking at the structures
sometimes in the classification.
It can be more accurate.
For example,
if you want to classify articles into
different categories corresponding
to different authors.
You want to figure out which of
the k authors has actually written
this article, then you generally need
to look at the syntactic structures.
When we add entities and relations,
then we can enable other techniques
such as knowledge graph and
answers, or information network and
answers in general.
And this analysis enable
applications about entities.
For example,
discovery of all the knowledge and
opinions about real world entities.
You can also use this level representation
to integrate everything about
anything from scaled resources.
Finally, when we add logical predicates,
that would enable large inference,
of course.
And this can be very useful for
integrating analysis of
scattered knowledge.
For example,
we can also add ontology on top of the,
extracted the information from text,
to make inferences.
A good of example of application in this
enabled by this level of representation,
is a knowledge assistant for biologists.
And this program that can help a biologist
manage all the relevant knowledge from
literature about a research problem such
as understanding functions of genes.
And the computer can make inferences
about some of the hypothesis that
the biologist might be interesting.
For example,
whether a gene has a certain function, and
then the intelligent program can read the
literature to extract the relevant facts,
doing compiling and
information extracting.
And then using a logic system to
actually track that's the answers
to researchers questioning about what
genes are related to what functions.
So in order to support
this level of application
we need to go as far as
logical representation.
Now, this course is covering techniques
mainly based on word based representation.
And these techniques are general and
robust and that's more widely
used in various applications.
In fact, in virtually all the text mining
applications you need this level of
representation and then techniques that
support analysis of text in this level.
But obviously all these other
levels can be combined and
should be combined in order to support
the sophisticated applications.
So to summarize,
here are the major takeaway points.
Text representation determines what
kind of mining algorithms can be applied.
And there are multiple ways to
represent the text, strings, words,
syntactic structures, entity-relation
graphs, knowledge predicates, etc.
And these different
representations should in general
be combined in real applications
to the extent we can.
For example, even if we cannot
do accurate representations
of syntactic structures, we can state
that partial structures strictly.
And if we can recognize some entities,
that would be great.
So in general we want to
do as much as we can.
And when different levels
are combined together,
we can enable a richer analysis,
more powerful analysis.
This course however focuses
on word-based representation.
Such techniques have also several
advantage, first of they are general and
robust, so they are applicable
to any natural language.
That's a big advantage over
other approaches that rely on
more fragile natural language
processing techniques.
Secondly, it does not require
much manual effort, or
sometimes, it does not
require any manual effort.
So that's, again, an important benefit,
because that means that you can apply
it directly to any application.
Third, these techniques are actually
surprisingly powerful and
effective form in implications.
Although not all of course
as I just explained.
Now they are very effective
partly because the words
are invented by humans as basically
units for communications.
So they are actually quite sufficient for
representing all kinds of semantics.
So that makes this kind of word-based
representation all so powerful.
And finally, such a word-based
representation and the techniques enable
by such a representation can be combined
with many other sophisticated approaches.
So they're not competing with each other.
[MUSIC]

[SOUND] This lecture is
about the word association
mining and analysis.
In this lecture,
we're going to talk about how to mine
associations of words from text.
Now this is an example of knowledge
about the natural language that
we can mine from text data.
Here's the outline.
We're going to first talk about
what is word association and
then explain why discovering such
relations is useful and finally
we're going to talk about some general
ideas about how to mine word associations.
In general there are two word
relations and these are quite basic.
One is called a paradigmatic relation.
The other is syntagmatic relation.
A and B have paradigmatic relation
if they can be substituted for each other.
That means the two words that
have paradigmatic relation
would be in the same semantic class,
or syntactic class.
And we can in general
replace one by the other
without affecting
the understanding of the sentence.
That means we would still
have a valid sentence.
For example, cat and dog, these two
words have a paradigmatic relation
because they are in
the same class of animal.
And in general,
if you replace cat with dog in a sentence,
the sentence would still be a valid
sentence that you can make sense of.
Similarly Monday and
Tuesday have paradigmatical relation.
The second kind of relation is
called syntagmatical relation.
In this case, the two words that have this
relation, can be combined with each other.
So A and B have syntagmatic relation if
they can be combined with each other in
a sentence, that means these two
words are semantically related.
So for example, cat and sit are related
because a cat can sit somewhere.
Similarly, car and
drive are related semantically and
they can be combined with
each other to convey meaning.
However, in general, we can not
replace cat with sit in a sentence or
car with drive in the sentence
to still get a valid sentence,
meaning that if we do that, the sentence
will become somewhat meaningless.
So this is different from
paradigmatic relation.
And these two relations are in fact so
fundamental that they can be
generalized to capture basic relations
between units in arbitrary sequences.
And definitely they can be
generalized to describe
relations of any items in a language.
So, A and B don't have to be words and
they can be phrases, for example.
And they can even be more complex
phrases than just a non-phrase.
If you think about the general
problem of the sequence mining
then we can think about the units
being and the sequence data.
Then we think of paradigmatic
relation as relations that
are applied to units that tend to occur
in a singular locations in a sentence,
or in a sequence of data
elements in general.
So they occur in similar locations
relative to the neighbors in the sequence.
Syntagmatical relation on
the other hand is related to
co-occurrent elements that tend
to show up in the same sequence.
So these two are complimentary and
are basic relations of words.
And we're interested in discovering
them automatically from text data.
Discovering such worded
relations has many applications.
First, such relations can be directly
useful for improving accuracy of many NLP
tasks, and this is because this is part
of our knowledge about a language.
So if you know these two words
are synonyms, for example,
and then you can help a lot of tasks.
And grammar learning can be also
done by using such techniques.
Because if we can learn
paradigmatic relations,
then we form classes of words,
syntactic classes for example.
And if we learn syntagmatic relations,
then we would be able to know
the rules for putting together a larger
expression based on component expressions.
So we learn the structure and
what can go with what else.
Word relations can be also very useful for
many applications in text retrieval and
mining.
For example, in search and
text retrieval, we can use word
associations to modify a query,
and this can be used to
introduce additional related words into
a query and make the query more effective.
It's often called a query expansion.
Or you can use related words to
suggest related queries to the user
to explore the information space.
Another application is to
use word associations to
automatically construct the top
of the map for browsing.
We can have words as nodes and
associations as edges.
A user could navigate from
one word to another to
find information in the information space.
Finally, such word associations can also
be used to compare and summarize opinions.
For example, we might be interested
in understanding positive and
negative opinions about the iPhone 6.
In order to do that, we can look at what
words are most strongly associated with
a feature word like battery in
positive versus negative reviews.
Such a syntagmatical
relations would help us
show the detailed opinions
about the product.
So, how can we discover such
associations automatically?
Now, here are some intuitions
about how to do that.
Now let's first look at
the paradigmatic relation.
Here we essentially can take
advantage of similar context.
So here you see some simple
sentences about cat and dog.
You can see they generally
occur in similar context,
and that after all is the definition
of paradigmatic relation.
On the right side you can kind
of see I extracted expressly
the context of cat and
dog from this small sample of text data.
I've taken away cat and
dog from these sentences, so
that you can see just the context.
Now, of course we can have different
perspectives to look at the context.
For example, we can look at
what words occur in the left
part of this context.
So we can call this left context.
What words occur before we see cat or dog?
So, you can see in this case, clearly
dog and cat have similar left context.
You generally say his cat or my cat and
you say also, my dog and his dog.
So that makes them similar
in the left context.
Similarly, if you look at the words
that occur after cat and dog,
which we can call right context,
they are also very similar in this case.
Of course, it's an extreme case,
where you only see eats.
And in general,
you'll see many other words, of course,
that can't follow cat and dog.
You can also even look
at the general context.
And that might include all
the words in the sentence or
in sentences around this word.
And even in the general context, you also
see similarity between the two words.
So this was just a suggestion
that we can discover paradigmatic
relation by looking at
the similarity of context of words.
So, for example,
if we think about the following questions.
How similar are context of cat and
context of dog?
In contrast how similar are context
of cat and context of computer?
Now, intuitively,
we're to imagine the context of cat and
the context of dog would
be more similar than
the context of cat and
context of the computer.
That means, in the first case
the similarity value would be high,
between the context of cat and
dog, where as in the second,
the similarity between context of cat and
computer would be low
because they all not having a paradigmatic
relationship and imagine what words
occur after computer in general.
It would be very different from
what words occur after cat.
So this is the basic idea of what
this covering, paradigmatic relation.
What about the syntagmatic relation?
Well, here we're going to explore
the correlated occurrences,
again based on the definition
of syntagmatic relation.
Here you see the same sample of text.
But here we're interested in knowing
what other words are correlated
with the verb eats and
what words can go with eats.
And if you look at the right
side of this slide and
you see,
I've taken away the two words around eats.
I've taken away the word to its left and
also the word to its
right in each sentence.
And then we ask the question, what words
tend to occur to the left of eats?
And what words tend to
occur to the right of eats?
Now thinking about this question
would help us discover syntagmatic
relations because syntagmatic relations
essentially captures such correlations.
So the important question to ask for
syntagmatical relation is,
whenever eats occurs,
what other words also tend to occur?
So the question here has
to do with whether there
are some other words that tend
to co-occur together with each.
Meaning that whenever you see eats
you tend to see the other words.
And if you don't see eats, probably,
you don't see other words often either.
So this intuition can help
discover syntagmatic relations.
Now again, consider example.
How helpful is occurrence of eats for
predicting occurrence of meat?
Right.
All right, so knowing whether eats occurs
in a sentence would generally help us
predict whether meat also occurs indeed.
And if we see eats occur in the sentence,
and
that should increase the chance
that meat would also occur.
In contrast,
if you look at the question in the bottom,
how helpful is the occurrence of eats for
predicting of occurrence of text?
Because eats and
text are not really related, so
knowing whether eats occurred
in the sentence doesn't
really help us predict the weather,
text also occurs in the sentence.
So this is in contrast to
the question about eats and meat.
This also helps explain that intuition
behind the methods of what
discovering syntagmatic relations.
Mainly we need to capture the correlation
between the occurrences of two words.
So to summarize the general ideas for
discovering word associations
are the following.
For paradigmatic relation,
we present each word by its context.
And then compute its context similarity.
We're going to assume the words
that have high context similarity
to have paradigmatic relation.
For syntagmatic relation, we will count
how many times two words occur together
in a context, which can be a sentence,
a paragraph, or a document even.
And we're going to compare
their co-occurrences with
their individual occurrences.
We're going to assume words
with high co-occurrences but
relatively low individual occurrences
to have syntagmatic relations
because they attempt to occur together and
they don't usually occur alone.
Note that the paradigmatic relation and
the syntagmatic relation
are actually closely related
in that paradigmatically
related words tend to have syntagmatic
relation with the same word.
They tend to be associated
with the same word, and
that suggests that we can also do join
the discovery of the two relations.
So these general ideas can be
implemented in many different ways.
And the course won't cover all of them,
but
we will cover at least some of
the methods that are effective for
discovering these relations.
[MUSIC]

[SOUND]
This
lecture is about
the Paradigmatics Relation Discovery.
In this lecture we are going to talk about
how to discover a particular kind of word
association called
a paradigmatical relation.
By definition,
two words are paradigmatically
related if they share a similar context.
Namely, they occur in
similar positions in text.
So naturally our idea of discovering such
a relation is to look at the context
of each word and then try to compute
the similarity of those contexts.
So here is an example of
context of a word, cat.
Here I have taken the word
cat out of the context and
you can see we are seeing some remaining
words in the sentences that contain cat.
Now, we can do the same thing for
another word like dog.
So in general we would like to capture
such a context and then try to assess
the similarity of the context of cat and
the context of a word like dog.
So now the question is how can we
formally represent the context and
then define the similarity function.
So first, we note that the context
actually contains a lot of words.
So, they can be regarded as
a pseudo document, a imagine
document, but there are also different
ways of looking at the context.
For example, we can look at the word
that occurs before the word cat.
We can call this context Left1 context.
All right, so in this case you
will see words like my, his, or
big, a, the, et cetera.
These are the words that can
occur to left of the word cat.
So we say my cat, his cat,
big cat, a cat, et cetera.
Similarly, we can also collect the words
that occur right after the word cat.
We can call this context Right1, and
here we see words like eats,
ate, is, has, et cetera.
Or, more generally,
we can look at all the words in
the window of text around the word cat.
Here, let's say we can take a window
of 8 words around the word cat.
We call this context Window8.
Now, of course, you can see all
the words from left or from right, and
so we'll have a bag of words in
general to represent the context.
Now, such a word based representation
would actually give us
an interesting way to define the
perspective of measuring the similarity.
Because if you look at just
the similarity of Left1,
then we'll see words that share
just the words in the left context,
and we kind of ignored the other words
that are also in the general context.
So that gives us one perspective to
measure the similarity, and similarly,
if we only use the Right1 context,
we will capture this narrative
from another perspective.
Using both the Left1 and
Right1 of course would allow us to capture
the similarity with even
more strict criteria.
So in general, context may contain
adjacent words, like eats and
my, that you see here, or
non-adjacent words, like Saturday,
Tuesday, or
some other words in the context.
And this flexibility also allows us
to match the similarity in somewhat
different ways.
Sometimes this is useful,
as we might want to capture
similarity base on general content.
That would give us loosely
related paradigmatical relations.
Whereas if you use only the words
immediately to the left and
to the right of the word, then you
likely will capture words that are very
much related by their syntactical
categories and semantics.
So the general idea of discovering
paradigmatical relations
is to compute the similarity
of context of two words.
So here, for example,
we can measure the similarity of cat and
dog based on the similarity
of their context.
In general, we can combine all
kinds of views of the context.
And so the similarity function is,
in general,
a combination of similarities
on different context.
And of course, we can also assign
weights to these different
similarities to allow us to focus
more on a particular kind of context.
And this would be naturally
application specific, but again,
here the main idea for discovering
pardigmatically related words is
to computer the similarity
of their context.
So next let's see how we exactly
compute these similarity functions.
Now to answer this question,
it is useful to think of bag of words
representation as vectors
in a vector space model.
Now those of you who have been
familiar with information retrieval or
textual retrieval techniques would
realize that vector space model has
been used frequently for
modeling documents and queries for search.
But here we also find it convenient
to model the context of a word for
paradigmatic relation discovery.
So the idea of this
approach is to view each
word in our vocabulary as defining one
dimension in a high dimensional space.
So we have N words in
total in the vocabulary,
then we have N dimensions,
as illustrated here.
And on the bottom, you can see a frequency
vector representing a context,
and here we see where eats
occurred 5 times in this context,
ate occurred 3 times, et cetera.
So this vector can then be placed
in this vector space model.
So in general,
we can represent a pseudo document or
context of cat as one vector,
d1, and another word,
dog, might give us a different context,
so d2.
And then we can measure
the similarity of these two vectors.
So by viewing context in
the vector space model,
we convert the problem of
paradigmatical relation discovery
into the problem of computing
the vectors and their similarity.
So the two questions that we
have to address are first,
how to compute each vector, and
that is how to compute xi or yi.
And the other question is how
do you compute the similarity.
Now in general, there are many approaches
that can be used to solve the problem, and
most of them are developed for
information retrieval.
And they have been shown to work well for
matching a query vector and
a document vector.
But we can adapt many of
the ideas to compute a similarity
of context documents for our purpose here.
So let's first look at
the one plausible approach,
where we try to match
the similarity of context based on
the expected overlap of words,
and we call this EOWC.
So the idea here is to represent
a context by a word vector
where each word has a weight
that's equal to the probability
that a randomly picked word from
this document vector, is this word.
So in other words,
xi is defined as the normalized
account of word wi in the context, and
this can be interpreted as
the probability that you would
actually pick this word from d1
if you randomly picked a word.
Now, of course these xi's would sum to one
because they are normalized frequencies,
and this means the vector is
actually probability of
the distribution over words.
So, the vector d2 can be also
computed in the same way, and
this would give us then two probability
distributions representing two contexts.
So, that addresses the problem
how to compute the vectors, and
next let's see how we can define
similarity in this approach.
Well, here, we simply define
the similarity as a dot product of two
vectors, and
this is defined as a sum of the products
of the corresponding
elements of the two vectors.
Now, it's interesting to see
that this similarity function
actually has a nice interpretation,
and that is this.
Dot product, in fact that gives
us the probability that two
randomly picked words from
the two contexts are identical.
That means if we try to pick a word
from one context and try to pick another
word from another context, we can then
ask the question, are they identical?
If the two contexts are very similar,
then we should expect we frequently will
see the two words picked from
the two contexts are identical.
If they are very different,
then the chance of seeing
identical words being picked from
the two contexts would be small.
So this intuitively makes sense, right,
for measuring similarity of contexts.
Now you might want to also take
a look at the exact formulas and
see why this can be interpreted
as the probability that
two randomly picked words are identical.
So if you just stare at the formula
to check what's inside this sum,
then you will see basically in each
case it gives us the probability that
we will see an overlap on
a particular word, wi.
And where xi gives us a probability that
we will pick this particular word from d1,
and yi gives us the probability
of picking this word from d2.
And when we pick the same
word from the two contexts,
then we have an identical pick, right so.
That's one possible approach, EOWC,
extracted overlap of words in context.
Now as always, we would like to assess
whether this approach it would work well.
Now of course, ultimately we have to
test the approach with real data and
see if it gives us really
semantically related words.
Really give us paradigmatical relations,
but
analytically we can also analyze
this formula a little bit.
So first, as I said,
it does make sense, right, because this
formula will give a higher score if there
is more overlap between the two contexts.
So that's exactly what we want.
But if you analyze
the formula more carefully,
then you also see there might
be some potential problems,
and specifically there
are two potential problems.
First, it might favor matching
one frequent term very well,
over matching more distinct terms.
And that is because in the dot product,
if one element has a high value and this
element is shared by both contexts and
it contributes a lot to the overall sum,
it might indeed make the score
higher than in another case,
where the two vectors actually have
a lot of overlap in different terms.
But each term has a relatively low
frequency, so this may not be desirable.
Of course, this might be
desirable in some other cases.
But in our case, we should intuitively
prefer a case where we match
more different terms in the context,
so that we have more confidence
in saying that the two words
indeed occur in similar context.
If you only rely on one term and
that's a little bit questionable,
it may not be robust.
Now the second problem is that it
treats every word equally, right.
So if you match a word like the and
it will be the same as
matching a word like eats, but
intuitively we know
matching the isn't really
surprising because the occurs everywhere.
So matching the is not as such
strong evidence as matching what
a word like eats,
which doesn't occur frequently.
So this is another
problem of this approach.
In the next chapter we are going to talk
about how to address these problems.
[MUSIC]

[SOUND] In this lecture
we continue discussing
Paradigmatic Relation Discovery.
Earlier we introduced a method called
Expected Overlap of Words in Context.
In this method we represent each
context by a word of vector
that represents the probability
of a word in the context.
And we measure the similarity by using the
dot product which can be interpreted as
the probability that two randomly picked
words from the two contexts are identical.
We also discussed the two
problems of this method.
The first is that it favors
matching one frequent term
very well over matching
more distinct terms.
It put too much emphasis on
matching one term very well.
The second is that it
treats every word equally.
Even a common word like
the would contribute equally
as content word like eats.
So now we are going to talk about
how to solve this problems.
More specifically we're going to
introduce some retrieval heuristics
used in text retrieval and these
heuristics can effectively solve these
problems as these problems also occur
in text retrieval when we match a query
with a document, so
to address the first problem,
we can use a sublinear
transformation of term frequency.
That is, we don't have to use raw
frequency count of the term to represent
the context.
We can transform it into some form that
wouldn't emphasize so much on the raw
frequency to address the problem,
we can put more weight on rare terms.
And that is,
we ran reward a matching a rare word.
And this heuristic is called IDF
term weighting in text retrieval.
IDF stands for inverse document frequency.
So now we're going to talk about
the two heuristics in more detail.
First, let's talk about
the TF transformation.
That is, it'll convert the raw count of
a word in the document into some weight
that reflects our belief about
how important this wording.
The document.
And so,
that would be denoted by TF of w and d.
That's shown in the Y axis.
Now, in general,
there are many ways to map that.
And let's first look at
the the simple way of mapping.
In this case, we're going to say, well,
any non zero counts will be mapped to one.
And the zero count will be mapped to zero.
So with this mapping, all the frequencies
will be mapped to only two values,
zero or one.
And the mapping function is
shown here as a flat line here.
This is naive because in order
the frequency of words, however,
this actually has
advantage of emphasizing,
matching all the words in the context.
It does not allow a frequent
word to dominate the match now
the approach that we have taken earlier
in the overlap account approach
is a linear transformation we
basically take y as the same as x so
we use the raw count as
a representation and
that created the problem
that we just talked about.
Namely, it emphasizes too much
on matching one frequent term.
Matching one frequent term
can contribute a lot.
We can have a lot of other interesting
transformations in between
the two extremes.
And they generally form
a sub linear transformation.
So for example,
one a logarithm of the row count.
And this will give us curve that looks
like this that you are seeing here.
In this case,
you can see the high frequency counts.
The high counts are penalized
a little bit all right,
so the curve is a sub linear curve.
And it brings down the weight
of those really high counts.
And this what we want because it prevents
that kind of terms from
dominating the scoring function.
Now, there is also another interesting
transformation called a BM25
transformation, which as been shown
to be very effective for retrieval.
And in this transformation we
have a form that looks like this.
So it's k plus one multiplies by x,
divided by x plus k.
Where k is a parameter.
X is the count.
The raw count of a word.
Now the transformation is very
interesting, in that it can actually
kind of go from one extreme to
the other extreme by varying k,
and it also is interesting that it
has upper bound, k + 1 in this case.
So, this puts a very strict
constraint on high frequency terms,
because their weight
will never exceed k + 1.
As we vary k,
we can simulate the two extremes.
So, when is set to zero,
we roughly have the zero one vector.
Whereas, when we set the k
to a very large value,
it will behave more like,
immediate transformation.
So this transformation function is by far
the most effective transformation function
for tax and retrieval, and it also
makes sense for our problem set up.
So we just talked about how to solve the
problem of overemphasizing a frequently,
a frequently tongue.
Now let's look at the second problem, and
that is how we can penalize popular terms,
matching the is not surprising
because the occurs everywhere.
But matching eats would count a lot so
how can we address that problem.
In this case we can use the IDF weight.
Pop that's commonly used in retrieval.
IDF stands for inverse document frequency.
Now frequency means the count of
the total number of documents
that contain a particular word.
So here we show that the IDF measure
is defined as a logarithm function
of the number of documents that
match a term or document frequency.
So, k is the number of documents
containing a word, or document frequency.
And M here is the total number
of documents in the collection.
The IDF function is
giving a higher value for
a lower k,
meaning that it rewards a rare term, and
the maximum value is log of M+1.
That's when the word occurred just once in
the context, so that's a very rare term.
The rarest term in the whole collection.
The lowest value you can see here is when
K reaches its maximum, which would be M.
All right so,
that would be a very low value,
close to zero in fact.
So, this of course measure
is used in search.
Where we naturally have a collection.
In our case, what would be our collection?
Well, we can also use the context
that we had collected for
all the words as our collection.
And that is to say, a word that's
populating the collection in general.
Would also have a low
IDF because depending
on the dataset we can Construct
the context vectors in the different ways.
But in the end, if a term is very
frequently original data set.
Then it will still be frequenting
the collective context documents.
So how can we add these
heuristics to improve our
similarity function well here's one way.
And there are many other
ways that are possible.
But this is a reasonable way.
Where we can adapt the BM25
retrieval model for
paradigmatic relation mining.
So here, we define,
in this case we define
the document vector as
containing elements representing
normalized BM25 values.
So in this normalization function, we see,
we take a sum over, sum of all the words.
And we normalize the weight
of each word by the sum of
the weights of all the words.
And this is to, again, ensure all
the xi's will sum to 1 in this vector.
So this would be very similar
to what we had before,
in that this vector is actually something
similar to a word distribution.
Or the xis with sum to 1.
Now the weight of BM25 for
each word is defined here.
And if you compare this with our old
definition where we just have a normalized
count, of this one so
we only have this one and
the document lens of
the total counts of words.
Being that context document and
that's what we had before.
But now with the BM25 transformation,
we're introduced to something else.
First off, because this extra occurrence
of this count is just to achieve
the of normalization.
But we also see we introduced
the parameter k here.
And this parameter is generally non active
number although zero is also possible.
This controls the upper bound and
the kind of all
to what extent it simulates
the linear transformation.
And so this is one parameter, but we also
see there was another parameter here, B.
And this would be within 0 an 1.
And this is a parameter to
control length] normalization.
And in this case, the normalization
formula has average document length here.
And this is computed by
taking the average of
the lengths of all the documents
in the collection.
In this case, all the lengths
of all the context documents.
That we are considering.
So this average document will be
a constant for any given collection.
So it actually is only
affecting the factor of
the parameter b here
because this is a constant.
But I kept it here because it's
constant and that's useful
in retrieval where it would give us a
stabilized interpretation of parameter B.
But, for
our purpose it would be a constant.
So it would only be affecting the length
normalization together with parameter b.
Now with this definition then, we have a
new way to define our document of vectors.
And we can compute the vector
d2 in the same way.
The difference is that the high
frequency terms will now have a somewhat
lower weight.
And this would help us control the
influence of these high frequency terms.
Now, the idea can be added
here in the scoring function.
That means we will introduce a way for
matching each time.
You may recall, this is sum that indicates
all the possible words that can be
overlapped between the two contacts.
And the Xi and the Yi are probabilities
of picking the word from both context,
therefore,
it indicates how likely we'll
see a match on this word.
Now, IDF would give us the importance
of matching this word.
A common word will be worth
less than a rare word, and so
we emphasize more on
matching rare words now.
So, with this modification,
then the new function.
When likely to address those two problems.
Now interestingly,
we can also use this approach to
discover syntagmatical relations.
In general,
when we represent a term vector to replant
a context with a term
vector we would likely see,
some terms have higher weights, and
other terms have lower weights.
Depending on how we assign
weights to these terms,
we might be able to use
these weights to discover
the words that are strongly associated
with a candidate of word in the context.
It's interesting that we can
also use this context for
similarity function based on BM25
to discover syntagmatic relations.
So, the idea is to use the converted
implantation of the context.
To see which terms are scored high.
And if a term has high weight,
then that term might be more strongly
related to the candidate word.
So let's take a look at
the vector in more detail here.
And we have
each Xi defined as
a normalized weight of BM25.
Now this weight alone only reflects how
frequent the word occurs in the context.
But, we can't just say an infrequent
term in the context would be
correlated with the candidate word
because many common words like the will
occur frequently out of context.
But if we apply IDF
weighting as you see here,
we can then re weigh
these terms based on IDF.
That means the words that are common,
like the, will get penalized.
so now the highest weighted terms will not
be those common terms because they have
lower IDFs.
Instead, those terms would be the terms
that are frequently in the context but
not frequent in the collection.
So those are clearly the words
that tend to occur in the context
of the candidate word, for example, cat.
So, for this reason, the highly weighted
terms in this idea of weighted vector
can also be assumed to be candidates for
syntagmatic relations.
Now, of course, this is only
a byproduct of how approach is for
discovering parathmatic relations.
And in the next lecture,
we're going to talk more about how
to discover syntagmatic relations.
But it clearly shows the relation
between discovering the two relations.
And indeed they can be discussed.
Discovered in a joined
manner by leveraging
such associations, namely syntactical
relation words that are similar in,
yeah it also shows the relation between
syntagmatic relation discovery and
the paradgratical relations discovery.
We may be able to leverage the relation to
join the discovery of
two kinds of relations.
This also shows some interesting
connections between the discovery of
syntagmatic relation and
the paradigmatic relation.
Specifically those words that
are paradigmatic related tend to be
having a syntagmatic
relation with the same word.
So to summarize the main idea of what
is covering paradigmatic relations
is to collect the context of a candidate
word to form a pseudo document,
and this is typically
represented as a bag of words.
And then compute similarity of
the corresponding context documents
of two candidate words.
And then we can take the highly
similar word pairs and
treat them as having
paradigmatic relations.
These are the words that
share similar contexts.
There are many different ways
to implement this general idea,
and we just talked about
some of the approaches, and
more specifically we talked about
using text retrieval models to help
us design effective similarity function
to compute the paradigmatic relations.
More specifically we
have used the BM25 and
IDF weighting to discover
paradigmatic relation.
And these approaches also
represent the state of the art.
In text retrieval techniques.
Finally, syntagmatic relations
can also be discovered as a by
product when we discover
paradigmatic relations.
[MUSIC]

[SOUND]
>> This
lecture is about topic mining and
analysis.
We're going to talk about its
motivation and task definition.
In this lecture we're going to talk
about different kind of mining task.
As you see on this road map,
we have just covered
mining knowledge about language,
namely discovery of
word associations such as paradigmatic and
relations and syntagmatic relations.
Now, starting from this lecture, we're
going to talk about mining another kind of
knowledge, which is content mining, and
trying to discover knowledge about
the main topics in the text.
And we call that topic mining and
analysis.
In this lecture, we're going to talk about
its motivation and the task definition.
So first of all,
let's look at the concept of topic.
So topic is something that we
all understand, I think, but
it's actually not that
easy to formally define.
Roughly speaking, topic is the main
idea discussed in text data.
And you can think of this as a theme or
subject of a discussion or conversation.
It can also have different granularities.
For example,
we can talk about the topic of a sentence.
A topic of article,
aa topic of paragraph or
the topic of all the research articles
in the research library, right,
so different grand narratives of topics
obviously have different applications.
Indeed, there are many applications that
require discovery of topics in text, and
they're analyzed then.
Here are some examples.
For example, we might be interested
in knowing about what are Twitter
users are talking about today?
Are they talking about NBA sports, or
are they talking about some
international events, etc.?
Or we are interested in
knowing about research topics.
For example, one might be interested in
knowing what are the current research
topics in data mining, and how are they
different from those five years ago?
Now this involves discovery of topics
in data mining literatures and
also we want to discover topics in
today's literature and those in the past.
And then we can make a comparison.
We might also be also interested in
knowing what do people like about
some products like the iPhone 6,
and what do they dislike?
And this involves discovering
topics in positive opinions about
iPhone 6 and
also negative reviews about it.
Or perhaps we're interested in knowing
what are the major topics debated in 2012
presidential election?
And all these have to do with discovering
topics in text and analyzing them,
and we're going to talk about a lot
of techniques for doing this.
In general we can view a topic as
some knowledge about the world.
So from text data we expect to
discover a number of topics, and
then these topics generally provide
a description about the world.
And it tells us something about the world.
About a product, about a person etc.
Now when we have some non-text data,
then we can have more context for
analyzing the topics.
For example, we might know the time
associated with the text data, or
locations where the text
data were produced,
or the authors of the text, or
the sources of the text, etc.
All such meta data, or
context variables can be associated
with the topics that we discover, and
then we can use these context variables
help us analyze patterns of topics.
For example, looking at topics over time,
we would be able to discover
whether there's a trending topic, or
some topics might be fading away.
Soon you are looking at topics
in different locations.
We might know some insights about
people's opinions in different locations.
So that's why mining
topics is very important.
Now, let's look at the tasks
of topic mining and analysis.
In general, it would involve first
discovering a lot of topics, in this case,
k topics.
And then we also would like to know, which
topics are covered in which documents,
to what extent.
So for example, in document one, we
might see that Topic 1 is covered a lot,
Topic 2 and
Topic k are covered with a small portion.
And other topics,
perhaps, are not covered.
Document two, on the other hand,
covered Topic 2 very well,
but it did not cover Topic 1 at all, and
it also covers Topic k to some extent,
etc., right?
So now you can see there
are generally two different tasks, or
sub-tasks, the first is to discover k
topics from a collection of text laid out.
What are these k topics?
Okay, major topics in the text they are.
The second task is to figure out
which documents cover which topics
to what extent.
So more formally,
we can define the problem as follows.
First, we have, as input,
a collection of N text documents.
Here we can denote the text
collection as C, and
denote text article as d i.
And, we generally also need to have
as input the number of topics, k.
But there may be techniques that can
automatically suggest a number of topics.
But in the techniques that we will
discuss, which are also the most useful
techniques, we often need to
specify a number of topics.
Now the output would then be the k
topics that we would like to discover,
in order as theta sub
one through theta sub k.
Also we want to generate the coverage of
topics in each document of d sub i And
this is denoted by pi sub i j.
And pi sub ij is the probability
of document d sub i
covering topic theta sub j.
So obviously for each document, we have
a set of such values to indicate to
what extent the document covers,
each topic.
And we can assume that these
probabilities sum to one.
Because a document won't be able to cover
other topics outside of the topics
that we discussed, that we discovered.
So now, the question is, how do we define
theta sub i, how do we define the topic?
Now this problem has not
been completely defined
until we define what is exactly theta.
So in the next few lectures,
we're going to talk about
different ways to define theta.
[MUSIC]

[MUSIC]
This lecture is about the expectation
maximization algorithms or
also called the EM algorithms.
In this lecture,
we're going to continue the discussion
of probabilistic topic models.
In particular,
we're going to introduce the EM algorithm.
Which is a family of useful algorithms for
computing the maximum life or
estimate of mixture models.
So, this is now a familiar scenario
of using two components, the mixture
model to try to fact out the background
words from one topic or word distribution.
Yeah.
So, we're interested in computing
this estimate and
we're going to try to adjust these
probability values to maximize
the probability of the observed documents.
And know that we're assumed all
the other parameters are known.
So, the only thing unknown is these water
properties, this given by zero something.
And in this lecture, we're going to look
into how to compute this maximum like or
estimate.
Now this started with the idea of
separating the words in
the text data into two groups.
One group will be explained
by the background model.
The other group will be explained
by the unknown topical order.
After all this is the basic
idea of the mixture model.
But, suppose we actually know which
word is from which distribution.
So that would mean, for example,
these words, the, is, and
we, are known to be from this
background origin, distribution.
On the other hand,
the other words, text mining,
clustering, etcetera are known to be
from the topic word, distribution.
If you can see the color,
that these are showing blue.
These blue words are, they are assumed
to be from the topic word, distribution.
If we already know how
to separate these words.
Then the problem of estimating
the word distribution
would be extremely simple, right?
If you think about this for
a moment, you'll realize that, well,
we can simply take all these
words that are known to be from
this word distribution,
see that's a d and normalize them.
So indeed this problem would be
very easy to solve if we had known
which words are from which
it is written precisely.
And this is in fact,
making this model no longer a mystery
model because we can already observe which
of these distribution has been used
to generate which part of the data.
So we, actually go back to the single
order distribution problem.
And in this case, let's call these words
that are known to be from theta d,
a pseudo document of d prime.
And now all we have to do is
just normalize these word
accounts for each word, w sub i.
And that's fairly straightforward,
and it's just dictated by
the maximum estimator.
Now, this idea, however,
doesn't work because we in practice,
don't really know which word
is from which distribution.
But this gives us an idea of perhaps
we can guess which word is
from which distribution.
Specifically, given all the parameters,
can we infer the distribution
a word is from?
So let's assume that we actually
know tentative probabilities for
these words in theta sub d.
So now all the parameters are known for
this mystery model.
Now let's consider word, like a text.
So the question is,
do you think text is more likely,
having been generated from theta sub d or
from theta sub b?
So, in other words,
we are to infer which distribution
has been used to generate this text.
Now, this inference process is a typical
of basing an inference situation,
where we have some prior about
these two distributions.
So can you see what is our prior here?
Well, the prior here is the probability
of each distribution, right.
So the prior is given by
these two probabilities.
In this case, the prior is saying
that each model is equally likely.
But we can imagine perhaps
a different apply is possible.
So this is called a pry
because this is our guess
of which distribution has been
used to generate the word.
Before we even observed the word.
So that's why we call it a pry.
If we don't observe the word we don't
know what word has been observed.
Our best guess is to say,
well, they're equally likely.
So it's just like flipping a coin.
Now in basic inference,
we typical them with our belief
after we have observed the evidence.
So what is the evidence here?
Well, the evidence here is the word text.
Now that we know we're
interested in the word text.
So text can be regarded as evidence.
And if we use base
rule to combine the prior and
the theta likelihood,
what we will end up with
is to combine the prior
with the likelihood that you see here.
Which is basically the probability of
the word text from each distribution.
And we see that in both
cases text is possible.
Note that even in the background
it is still possible,
it just has a very small probability.
So intuitively what would be
your guess seeing this case?
Now if you're like many others,
you would guess text is probably
from c.subd it's more likely from c.subd,
why?
And you will probably see
that it's because text has
a much higher probability
here by the C now sub D than
by the background model which
has a very small probability.
And by this we're going to say well,
text is more likely from theta sub d.
So you see our guess of which
distributing has been used with
the generated text would depend on
how high the probability of the data,
the text, is in each word distribution.
We can do tentative guess that
distribution that gives is a word
higher probability.
And this is likely to
maximize the likelihood.
All right, so we are going to choose
a word that has a higher likelihood.
So, in other words we are going to
compare these two probabilities
of the word given by each
of these distributions.
But our guess must also
be affected by the prior.
So we also need to
compare these two priors.
Why?
Because imagine if we
adjust these probabilities.
We're going to say,
the probability of choosing
a background model is almost 100%.
Now if we have that kind of strong prior,
then that would affect your gas.
You might think,
well, wait a moment, maybe texter could
have been from the background as well.
Although the probability is very
small here the prior is very high.
So in the end, we have to combine the two.
And the base formula
provides us a solid and
principle way of making this
kind of guess to quantify that.
So more specifically, let's think about
the probability that this word text
has been generated in
fact from theta sub d.
Well, in order for text to be generated
from theta sub d, two things must happen.
First, the theta sub d
must have been selected.
So, we have the selection
probability here.
And secondly we also have to actually have
observed the text from the distribution.
So, when we multiply the two together,
we get the probability
that text has in fact been
generated from zero sub d.
Similarly, for the background model and
the probability of generating text
is another product of similar form.
Now we also introduced late in
the variable z here to denote
whether the word is from the background or
the topic.
When z is 0, it means it's from the topic,
theta sub d.
When it's 1, it means it's from
the background, theta sub B.
So now we have the probability
that text is generated from each,
then we can simply normalize
them to have estimate
of the probability that
the word text is from
theta sub d or from theta sub B.
And equivalently the probability
that Z is equal to zero,
given that the observed evidence is text.
So this is application of base rule.
But this step is very crucial for
understanding the EM hours.
Because if we can do this,
then we would be able to first,
initialize the parameter
values somewhat randomly.
And then, we're going to take
a guess of these Z values and
all, which distributing has been
used to generate which word.
And the initialize the parameter values
would allow us to have a complete
specification of the mixture model,
which allows us to apply Bayes'
rule to infer which distribution is
more likely to generate each word.
And this prediction essentially helped us
to separate words from
the two distributions.
Although we can't separate them for sure,
but we can separate then
probabilistically as shown here.
[MUSIC]

[SOUND]
So
this is indeed a general idea of
the Expectation-Maximization, or EM,
Algorithm.
So in all the EM algorithms we
introduce a hidden variable
to help us solve the problem more easily.
In our case the hidden variable
is a binary variable for
each occurrence of a word.
And this binary variable would
indicate whether the word has
been generated from 0 sub d or 0 sub p.
And here we show some possible
values of these variables.
For example, for the it's from background,
the z value is one.
And text on the other hand.
Is from the topic then it's zero for
z, etc.
Now, of course, we don't observe these z
values, we just imagine they're all such.
Values of z attaching to other words.
And that's why we call
these hidden variables.
Now, the idea that we
talked about before for
predicting the word distribution that
has been used when we generate the word
is it a predictor,
the value of this hidden variable?
And, so, the EM algorithm then,
would work as follows.
First, we'll initialize all
the parameters with random values.
In our case,
the parameters are mainly the probability.
of a word, given by theta sub d.
So this is an initial addition stage.
These initialized values would allow
us to use base roll to take a guess
of these z values, so
we'd guess these values.
We can't say for sure whether
textt is from background or not.
But we can have our guess.
This is given by this formula.
It's called an E-step.
And so the algorithm would then try to
use the E-step to guess these z values.
After that, it would then invoke
another that's called M-step.
In this step we simply take advantage
of the inferred z values and
then just group words that are in
the same distribution like these
from that ground including this as well.
We can then normalize the count
to estimate the probabilities or
to revise our estimate of the parameters.
So let me also illustrate
that we can group the words
that are believed to have
come from zero sub d, and
that's text, mining algorithm,
for example, and clustering.
And we group them together to help us
re-estimate the parameters
that we're interested in.
So these will help us
estimate these parameters.
Note that before we just set
these parameter values randomly.
But with this guess, we will have
somewhat improved estimate of this.
Of course, we don't know exactly
whether it's zero or one.
So we're not going to really
do the split in a hard way.
But rather we're going to
do a softer split.
And this is what happened here.
So we're going to adjust the count by
the probability that would believe
this word has been generated
by using the theta sub d.
And you can see this,
where does this come from?
Well, this has come from here, right?
From the E-step.
So the EM Algorithm would
iteratively improve uur initial
estimate of parameters by using
E-step first and then M-step.
The E-step is to augment the data
with additional information, like z.
And the M-step is to take advantage
of the additional information
to separate the data.
To split the data accounts and
then collect the right data accounts to
re-estimate our parameter.
And then once we have a new generation of
parameter, we're going to repeat this.
We are going the E-step again.
To improve our estimate
of the hidden variables.
And then that would lead to another
generation of re-estimated parameters.
For the word distribution
that we are interested in.
Okay, so, as I said,
the bridge between the two
is really the variable z, hidden variable,
which indicates how likely
this water is from the top water
distribution, theta sub p.
So, this slide has a lot of content and
you may need to.
Pause the reader to digest it.
But this basically captures
the essence of EM Algorithm.
Start with initial values that
are often random themself.
And then we invoke E-step followed
by M-step to get an improved
setting of parameters.
And then we repeated this, so
this a Hill-Climbing algorithm
that would gradually improve
the estimate of parameters.
As I will explain later
there is some guarantee for
reaching a local maximum of
the log-likelihood function.
So lets take a look at the computation for
a specific case, so
these formulas are the EM.
Formulas that you see before, and
you can also see there are superscripts,
here, like here, n,
to indicate the generation of parameters.
Like here for example we have n plus one.
That means we have improved.
From here to here we have an improvement.
So in this setting we have assumed the two
numerals have equal probabilities and
the background model is null.
So what are the relevance
of the statistics?
Well these are the word counts.
So assume we have just four words,
and their counts are like this.
And this is our background model that
assigns high probabilities to common
words like the.
And in the first iteration,
you can picture what will happen.
Well first we initialize all the values.
So here, this probability that we're
interested in is normalized into a uniform
distribution of all the words.
And then the E-step would give us a guess
of the distribution that has been used.
That will generate each word.
We can see we have different
probabilities for different words.
Why?
Well, that's because these words have
different probabilities in the background.
So even though the two
distributions are equally likely.
And then our initial audition say uniform
distribution because of the difference
in the background of the distribution,
we have different guess the probability.
So these words are believed to
be more likely from the topic.
These on the other hand are less likely.
Probably from background.
So once we have these z values,
we know in the M-step these probabilities
will be used to adjust the counts.
So four must be multiplied by this 0.33
in order to get the allocated
accounts toward the topic.
And this is done by this multiplication.
Note that if our guess says this
is 100% If this is one point zero,
then we just get the full count
of this word for this topic.
In general it's not going
to be one point zero.
So we're just going to get some percentage
of this counts toward this topic.
Then we simply normalize these counts
to have a new generation
of parameters estimate.
So you can see, compare this with
the older one, which is here.
So compare this with this one and
we'll see the probability is different.
Not only that, we also see some
words that are believed to have come from
the topic will have a higher probability.
Like this one, text.
And of course, this new generation of
parameters would allow us to further
adjust the inferred latent variable or
hidden variable values.
So we have a new generation of values,
because of the E-step based on
the new generation of parameters.
And these new inferred values
of Zs will give us then
another generation of the estimate
of probabilities of the word.
And so on and so forth so this is what
would actually happen when we compute
these probabilities
using the EM Algorithm.
As you can see in the last row
where we show the log-likelihood,
and the likelihood is increasing
as we do the iteration.
And note that these log-likelihood is
negative because the probability is
between 0 and 1 when you take a logarithm,
it becomes a negative value.
Now what's also interesting is,
you'll note the last column.
And these are the inverted word split.
And these are the probabilities
that a word is believed to
have come from one distribution, in this
case the topical distribution, all right.
And you might wonder whether
this would be also useful.
Because our main goal is to
estimate these word distributions.
So this is our primary goal.
We hope to have a more discriminative
order of distribution.
But the last column is also bi-product.
This also can actually be very useful.
You can think about that.
We want to use, is to for
example is to estimate to what extent this
document has covered background words.
And this, when we add this up or
take the average we will kind of know to
what extent it has covered background
versus content was that are not
explained well by the background.
[MUSIC]

So, I just showed you that empirically
the likelihood will converge,
but theoretically it can also
be proved that EM algorithm will
converge to a local maximum.
So here's just an illustration of what
happened and a detailed explanation.
This required more knowledge about that,
some of that inequalities,
that we haven't really covered yet.
So here what you see is on the X
dimension, we have a c0 value.
This is a parameter that we have.
On the y axis we see
the likelihood function.
So this curve is the original
likelihood function,
and this is the one that
we hope to maximize.
And we hope to find a c0 value
at this point to maximize this.
But in the case of Mitsumoto we can
not easily find an analytic solution
to the problem.
So, we have to resolve
the numerical errors, and
the EM algorithm is such an algorithm.
It's a Hill-Climb algorithm.
That would mean you start
with some random guess.
Let's say you start from here,
that's your starting point.
And then you try to improve
this by moving this to
another point where you can
have a higher likelihood.
So that's the ideal hill climbing.
And in the EM algorithm, the way we
achieve this is to do two things.
First, we'll fix a lower
bound of likelihood function.
So this is the lower bound.
See here.
And once we fit the lower bound,
we can then maximize the lower bound.
And of course, the reason why this works,
is because the lower bound
is much easier to optimize.
So we know our current guess is here.
And by maximizing the lower bound,
we'll move this point to the top.
To here.
Right?
And we can then map to the original
likelihood function, we find this point.
Because it's a lower bound, we are
guaranteed to improve this guess, right?
Because we improve our lower bound and
then the original likelihood
curve which is above this lower bound
will definitely be improved as well.
So we already know it's
improving the lower bound.
So we definitely improve this
original likelihood function,
which is above this lower bound.
So, in our example,
the current guess is parameter value
given by the current generation.
And then the next guess is
the re-estimated parameter values.
From this illustration you
can see the next guess
is always better than the current guess.
Unless it has reached the maximum,
where it will be stuck there.
So the two would be equal.
So, the E-step is basically
to compute this lower bound.
We don't directly just compute
this likelihood function but
we compute the length of
the variable values and
these are basically a part
of this lower bound.
This helps determine the lower bound.
The M-step on the other hand is
to maximize the lower bound.
It allows us to move
parameters to a new point.
And that's why EM algorithm is guaranteed
to converge to a local maximum.
Now, as you can imagine,
when we have many local maxima,
we also have to repeat the EM
algorithm multiple times.
In order to figure out which one
is the actual global maximum.
And this actually in general is a
difficult problem in numeral optimization.
So here for
example had we started from here,
then we gradually just
climb up to this top.
So, that's not optimal, and
we'd like to climb up all the way to here,
so the only way to climb up to this gear
is to start from somewhere here or here.
So, in the EM algorithm, we generally
would have to start from different points
or have some other way to determine
a good initial starting point.
To summarize in this lecture we
introduced the EM algorithm.
This is a general algorithm for computing
maximum maximum likelihood estimate of all
kinds of models, so
not just for our simple model.
And it's a hill-climbing algorithm, so it
can only converge to a local maximum and
it will depend on initial points.
The general idea is that we will have
two steps to improve the estimate of.
In the E-step we roughly [INAUDIBLE]
how many there are by predicting values
of useful hidden variables that we
would use to simplify the estimation.
In our case, this is the distribution
that has been used to generate the word.
In the M-step then we would exploit
such augmented data which would make
it easier to estimate the distribution,
to improve the estimate of parameters.
Here improve is guaranteed in
terms of the likelihood function.
Note that it's not necessary that we
will have a stable convergence of
parameter value even though the likelihood
function is ensured to increase.
There are some properties that have to
be satisfied in order for the parameters
also to convert into some stable value.
Now here data augmentation
is done probabilistically.
That means,
we're not going to just say exactly
what's the value of a hidden variable.
But we're going to have a probability
distribution over the possible values of
these hidden variables.
So this causes a split of counts
of events probabilistically.
And in our case we'll split the word
counts between the two distributions.
[MUSIC]

[SOUND]
This
lecture is about probabilistic and
latent Semantic Analysis or PLSA.
In this lecture we're going to introduce
probabilistic latent semantic analysis,
often called PLSA.
This is the most basic topic model,
also one of the most useful topic models.
Now this kind of models
can in general be used to
mine multiple topics from text documents.
And PRSA is one of the most basic
topic models for doing this.
So let's first examine this power
in the e-mail for more detail.
Here I show a sample article which is
a blog article about Hurricane Katrina.
And I show some simple topics.
For example government response,
flood of the city of New Orleans.
Donation and the background.
You can see in the article we use
words from all these distributions.
So we first for example see there's
a criticism of government response and
this is followed by discussion of flooding
of the city and donation et cetera.
We also see background
words mixed with them.
So the overall of topic analysis here
is to try to decode these topics behind
the text, to segment the topics,
to figure out which words are from which
distribution and to figure out first,
what are these topics?
How do we know there's a topic
about government response.
There's a topic about a flood in the city.
So these are the tasks
at the top of the model.
If we had discovered these
topics can color these words,
as you see here,
to separate the different topics.
Then you can do a lot of things,
such as summarization, or segmentation,
of the topics,
clustering of the sentences etc.
So the formal definition of problem of
mining multiple topics from text is
shown here.
And this is after a slide that you
have seen in an earlier lecture.
So the input is a collection, the number
of topics, and a vocabulary set, and
of course the text data.
And then the output is of two kinds.
One is the topic category,
characterization.
Theta i's.
Each theta i is a word distribution.
And second, it's the topic coverage for
each document.
These are pi sub i j's.
And they tell us which document it covers.
Which topic to what extent.
So we hope to generate these as output.
Because there are many useful
applications if we can do that.
So the idea of PLSA is
actually very similar to
the two component mixture model
that we have already introduced.
The only difference is that we
are going to have more than two topics.
Otherwise, it is essentially the same.
So here I illustrate how we can generate
the text that has multiple topics and
naturally in all cases
of Probabilistic modelling would want
to figure out the likelihood function.
So we would also ask the question,
what's the probability of observing
a word from such a mixture model?
Now if you look at this picture and
compare this with the picture
that we have seen earlier,
you will see the only difference is
that we have added more topics here.
So, before we have just one topic,
besides the background topic.
But now we have more topics.
Specifically, we have k topics now.
All these are topics that we assume
that exist in the text data.
So the consequence is that our switch for
choosing a topic is now a multiway switch.
Before it's just a two way switch.
We can think of it as flipping a coin.
But now we have multiple ways.
First we can flip a coin to decide
whether we're talk about the background.
So it's the background lambda
sub B versus non-background.
1 minus lambda sub B gives
us the probability of
actually choosing a non-background topic.
After we have made this decision,
we have to make another decision to
choose one of these K distributions.
So there are K way switch here.
And this is characterized by pi,
and this sum to one.
This is just the difference of designs.
Which is a little bit more complicated.
But once we decide which distribution to
use the rest is the same we are going to
just generate a word by using one of
these distributions as shown here.
So now lets look at the question
about the likelihood.
So what's the probability of observing
a word from such a distribution?
What do you think?
Now we've seen this
problem many times now and
if you can recall, it's generally a sum.
Of all the different possibilities
of generating a word.
So let's first look at how the word can
be generated from the background mode.
Well, the probability that the word is
generated from the background model
is lambda multiplied by the probability
of the word from the background mode.
Model, right.
Two things must happen.
First, we have to have
chosen the background model,
and that's the probability of lambda,
of sub b.
Then second, we must have actually
obtained the word w from the background,
and that's probability
of w given theta sub b.
Okay, so similarly,
we can figure out the probability of
observing the word from another topic.
Like the topic theta sub k.
Now notice that here's
the product of three terms.
And that's because of the choice
of topic theta sub k,
only happens if two things happen.
One is we decide not to
talk about background.
So, that's a probability
of 1 minus lambda sub B.
Second, we also have to actually choose
theta sub K among these K topics.
So that's probability of theta sub K,
or pi.
And similarly, the probability of
generating a word from the second.
The topic and the first topic
are like what you are seeing here.
And so
in the end the probability of observing
the word is just a sum of all these cases.
And I have to stress again this is a very
important formula to know because this is
really key to understanding all the topic
models and indeed a lot of mixture models.
So make sure that you really
understand the probability
of w is indeed the sum of these terms.
So, next,
once we have the likelihood function,
we would be interested in
knowing the parameters.
All right, so to estimate the parameters.
But firstly,
let's put all these together to have the
complete likelihood of function for PLSA.
The first line shows the probability of a
word as illustrated on the previous slide.
And this is an important
formula as I said.
So let's take a closer look at this.
This actually commands all
the important parameters.
So first of all we see lambda sub b here.
This represents a percentage
of background words
that we believe exist in the text data.
And this can be a known value
that we set empirically.
Second, we see the background
language model, and
typically we also assume this is known.
We can use a large collection of text, or
use all the text that we have available
to estimate the world of distribution.
Now next in the next stop this formula.
[COUGH] Excuse me.
You see two interesting
kind of parameters,
those are the most important parameters.
That we are.
So one is pi's.
And these are the coverage
of a topic in the document.
And the other is word distributions
that characterize all the topics.
So the next line,
then is simply to plug this
in to calculate
the probability of document.
This is, again, of the familiar
form where you have a sum and
you have a count of
a word in the document.
And then log of a probability.
Now it's a little bit more
complicated than the two component.
Because now we have more components,
so the sum involves more terms.
And then this line is just
the likelihood for the whole collection.
And it's very similar, just accounting for
more documents in the collection.
So what are the unknown parameters?
I already said that there are two kinds.
One is coverage,
one is word distributions.
Again, it's a useful exercise for
you to think about.
Exactly how many
parameters there are here.
How many unknown parameters are there?
Now, try and
think out that question will help you
understand the model in more detail.
And will also allow you to understand
what would be the output that we generate
when use PLSA to analyze text data?
And these are precisely
the unknown parameters.
So after we have obtained
the likelihood function shown here,
the next is to worry about
the parameter estimation.
And we can do the usual think,
maximum likelihood estimator.
So again, it's a constrained optimization
problem, like what we have seen before.
Only that we have a collection of text and
we have more parameters to estimate.
And we still have two constraints,
two kinds of constraints.
One is the word distributions.
All the words must have probabilities
that's sum to one for one distribution.
The other is the topic
coverage distribution and
a document will have to cover
precisely these k topics so
the probability of covering each
topic that would have to sum to 1.
So at this point though it's basically
a well defined applied math problem,
you just need to figure out
the solutions to optimization problem.
There's a function with many variables.
and we need to just figure
out the patterns of these
variables to make the function
reach its maximum.
>> [MUSIC]

[SOUND]
We can compute this maximum estimate
by using the EM algorithm.
So in the e step,
we now have to introduce more hidden
variables because we have more topics,
so our hidden variable z now,
which is a topic indicator can
take more than two values.
So specifically will
take a k plus one values,
with b in the noting the background.
And once locate,
to denote other k topics, right.
So, now the e step, as you can
recall is your augmented data, and
by predicting the values
of the hidden variable.
So we're going to predict for
a word, whether the word has come from
one of these k plus one distributions.
This equation allows us to
predict the probability
that the word w in document d is
generated from topic zero sub j.
And the bottom one is
the predicted probability that this
word has been generated
from the background.
Note that we use document
d here to index the word.
Why?
Because whether a word is
from a particular topic
actually depends on the document.
Can you see why?
Well, it's through the pi's.
The pi's are tied to each document.
Each document can have potentially
different pi's, right.
The pi's will then affect our prediction.
So, the pi's are here.
And this depends on the document.
And that might give a different guess for
a word in different documents,
and that's desirable.
In both cases we are using
the Baye's Rule, as I explained, basically
assessing the likelihood of generating
word from each of this division and
there's normalize.
What about the m step?
Well, we may recall the m step is we
take advantage of the inferred z values.
To split the counts.
And then collected the right counts
to re-estimate the parameters.
So in this case, we can re-estimate
our coverage of probability.
And this is re-estimated based on
collecting all the words in the document.
And that's why we have the count
of the word in document.
And sum over all the words.
And then we're going to look at to
what extent this word belongs to
the topic theta sub j.
And this part is our guess from each step.
This tells us how likely this word
is actually from theta sub j.
And when we multiply them together,
we get the discounted count that's
located for topic theta sub j.
And when we normalize
this over all the topics,
we get the distribution of all
the topics to indicate the coverage.
And similarly, the bottom one is the
estimated probability of word for a topic.
And in this case we are using exact
the same count, you can see this is
the same discounted account,
] it tells us to what extend we should
allocate this word [INAUDIBLE] but
then normalization is different.
Because in this case we are interested
in the word distribution, so
we simply normalize this
over all the words.
This is different, in contrast here we
normalize the amount all the topics.
It would be useful to take
a comparison between the two.
This give us different distributions.
And these tells us how to
improve the parameters.
And as I just explained,
in both the formula is we have a maximum
estimate based on allocated
word counts [INAUDIBLE].
Now this phenomena is actually general
phenomena in all the EM algorithms.
In the m-step, you general with
the computer expect an account of
the event based on the e-step result,
and then you just and
then count to four,
particular normalize it, typically.
So, in terms of computation
of this EM algorithm, we can
actually just keep accounting various
events and then normalize them.
And when we thinking this way,
we also have a more concise way
of presenting the EM Algorithm.
It actually helps us better
understand the formulas.
So I'm going to go over
this in some detail.
So as a algorithm we first initialize
all the unknown perimeters randomly,
all right.
So, in our case, we are interested in all
of those coverage perimeters, pi's and
awarded distributions [INAUDIBLE],
and we just randomly normalize them.
This is the initialization step and then
we will repeat until likelihood converges.
Now how do we know whether
likelihood converges?
We can do compute
likelihood at each step and
compare the current likelihood
with the previous likelihood.
If it doesn't change much and
we're going to say it stopped, right.
So, in each step we're
going to do e-step and m-step.
In the e-step we're going to do
augment the data by predicting
the hidden variables.
In this case,
the hidden variable, z sub d, w,
indicates whether the word w in
d is from a topic or background.
And if it's from a topic, which topic.
So if you look at the e-step formulas,
essentially we're actually
normalizing these counts, sorry,
these probabilities of observing
the word from each distribution.
So you can see,
basically the prediction of word
from topic zero sub j is
based on the probability of
selecting that theta sub j as a word
distribution to generate the word.
Multiply by the probability of observing
the word from that distribution.
And I said it's proportional to this
because in the implementation of
EM algorithm you can keep counter for
this quantity, and
in the end it just normalizes it.
So the normalization here
is over all the topics and
then you would get a probability.
Now, in the m-step, we do the same,
and we are going to collect these.
Allocated account for each topic.
And we split words among the topics.
And then we're going to normalize
them in different ways to obtain
the real estimate.
So for example, we can normalize among all
the topics to get the re-estimate of pi,
the coverage.
Or we can re-normalize
based on all the words.
And that would give us
a word distribution.
So it's useful to think algorithm in this
way because when implemented, you can just
use variables, but keep track of
these quantities in each case.
And then you just normalize these
variables to make them distribution.
Now I did not put the constraint for
this one.
And I intentionally leave
this as an exercise for you.
And you can see,
what's the normalizer for this one?
It's of a slightly different form but
it's essentially the same as
the one that you have
seen here in this one.
So in general in the envisioning of EM
algorithms you will see you accumulate
the counts, various counts and
then you normalize them.
So to summarize,
we introduced the PLSA model.
Which is a mixture model with k unigram
language models representing k topics.
And we also added a pre-determined
background language model to
help discover discriminative topics,
because this background language model
can help attract the common terms.
And we select the maximum estimate
that we cant discover topical
knowledge from text data.
In this case PLSA allows us to discover
two things, one is k worded distributions,
each one representing a topic and
the other is the proportion of
each topic in each document.
And such detailed characterization
of coverage of topics in documents
can enable a lot of photo analysis.
For example, we can aggregate
the documents in the particular
pan period to assess the coverage of
a particular topic in a time period.
That would allow us to generate
the temporal chains of topics.
We can also aggregate topics covered in
documents associated with a particular
author and then we can categorize
the topics written by this author, etc.
And in addition to this, we can also
cluster terms and cluster documents.
In fact,
each topic can be regarded as a cluster.
So we already have the term clusters.
In the higher probability,
the words can be regarded as
belonging to one cluster
represented by the topic.
Similarly, documents can be
clustered in the same way.
We can assign a document
to the topic cluster
that's covered most in the document.
So remember, pi's indicate to what extent
each topic is covered in the document,
we can assign the document to the topical
cluster that has the highest pi.
And in general there are many useful
applications of this technique.
[MUSIC]

[MUSIC]
[MUSIC]


[MUSIC]
This lecture is about topic mining and
analysis.
We're going to talk about
using a term as topic.
This is a slide that you have
seen in a earlier lecture
where we define the task of
topic mining and analysis.
We also raised the question, how do
we exactly define the topic of theta?
So in this lecture, we're going to
offer one way to define it, and
that's our initial idea.
Our idea here is defining
a topic simply as a term.
A term can be a word or a phrase.
And in general,
we can use these terms to describe topics.
So our first thought is just
to define a topic as one term.
For example, we might have terms
like sports, travel, or science,
as you see here.
Now if we define a topic in this way,
we can then analyze the coverage
of such topics in each document.
Here for example,
we might want to discover to what
extent document one covers sports.
And we found that 30% of the content
of document one is about sports.
And 12% is about the travel, etc.
We might also discover document
two does not cover sports at all.
So the coverage is zero, etc.
So now, of course,
as we discussed in the task definition for
topic mining and analysis,
we have two tasks.
One is to discover the topics.
And the second is to analyze coverage.
So let's first think
about how we can discover
topics if we represent
each topic by a term.
So that means we need to mine k
topical terms from a collection.
Now there are, of course,
many different ways of doing that.
And we're going to talk about
a natural way of doing that,
which is also likely effective.
So first of all,
we're going to parse the text data in
the collection to obtain candidate terms.
Here candidate terms can be words or
phrases.
Let's say the simplest solution is
to just take each word as a term.
These words then become candidate topics.
Then we're going to design a scoring
function to match how good each term
is as a topic.
So how can we design such a function?
Well there are many things
that we can consider.
For example, we can use pure statistics
to design such a scoring function.
Intuitively, we would like to
favor representative terms,
meaning terms that can represent
a lot of content in the collection.
So that would mean we want
to favor a frequent term.
However, if we simply use the frequency
to design the scoring function,
then the highest scored terms
would be general terms or
functional terms like the, etc.
Those terms occur very frequently English.
So we also want to avoid having
such words on the top so
we want to penalize such words.
But in general, we would like to favor
terms that are fairly frequent but
not so frequent.
So a particular approach could be based
on TF-IDF weighting from retrieval.
And TF stands for term frequency.
IDF stands for inverse document frequency.
We talked about some of these
ideas in the lectures about
the discovery of word associations.
So these are statistical methods,
meaning that the function is
defined mostly based on statistics.
So the scoring function
would be very general.
It can be applied to any language,
any text.
But when we apply such a approach
to a particular problem,
we might also be able to leverage
some domain-specific heuristics.
For example, in news we might favor
title words actually general.
We might want to favor title
words because the authors tend to
use the title to describe
the topic of an article.
If we're dealing with tweets,
we could also favor hashtags,
which are invented to denote topics.
So naturally, hashtags can be good
candidates for representing topics.
Anyway, after we have this design
scoring function, then we can discover
the k topical terms by simply picking
k terms with the highest scores.
Now, of course,
we might encounter situation where the
highest scored terms are all very similar.
They're semantically similar, or
closely related, or even synonyms.
So that's not desirable.
So we also want to have coverage over
all the content in the collection.
So we would like to remove redundancy.
And one way to do that is
to do a greedy algorithm,
which is sometimes called a maximal
marginal relevance ranking.
Basically, the idea is to go down
the list based on our scoring
function and gradually take terms
to collect the k topical terms.
The first term, of course, will be picked.
When we pick the next term, we're
going to look at what terms have already
been picked and try to avoid
picking a term that's too similar.
So while we are considering
the ranking of a term in the list,
we are also considering
the redundancy of the candidate term
with respect to the terms
that we already picked.
And with some thresholding,
then we can get a balance of
the redundancy removal and
also high score of a term.
Okay, so
after this that will get k topical terms.
And those can be regarded as the topics
that we discovered from the connection.
Next, let's think about how we're going
to compute the topic coverage pi sub ij.
So looking at this picture,
we have sports, travel and science and
these topics.
And now suppose you are give a document.
How should we pick out coverage
of each topic in the document?
Well, one approach can be to simply
count occurrences of these terms.
So for example, sports might have occurred
four times in this this document and
travel occurred twice, etc.
And then we can just normalize these
counts as our estimate of the coverage
probability for each topic.
So in general, the formula would
be to collect the counts of
all the terms that represent the topics.
And then simply normalize them so
that the coverage of each
topic in the document would add to one.
This forms a distribution of the topics
for the document to characterize coverage
of different topics in the document.
Now, as always,
when we think about idea for
solving problem, we have to ask
the question, how good is this one?
Or is this the best way
of solving problem?
So now let's examine this approach.
In general,
we have to do some empirical evaluation
by using actual data sets and
to see how well it works.
Well, in this case let's take
a look at a simple example here.
And we have a text document that's
about a NBA basketball game.
So in terms of the content,
it's about sports.
But if we simply count these
words that represent our topics,
we will find that the word sports
actually did not occur in the article,
even though the content
is about the sports.
So the count of sports is zero.
That means the coverage of sports
would be estimated as zero.
Now of course,
the term science also did not occur in
the document and
it's estimate is also zero.
And that's okay.
But sports certainly is not okay because
we know the content is about sports.
So this estimate has problem.
What's worse, the term travel
actually occurred in the document.
So when we estimate the coverage
of the topic travel,
we have got a non-zero count.
So its estimated coverage
will be non-zero.
So this obviously is also not desirable.
So this simple example illustrates
some problems of this approach.
First, when we count what
words belong to to the topic,
we also need to consider related words.
We can't simply just count
the topic word sports.
In this case, it did not occur at all.
But there are many related words
like basketball, game, etc.
So we need to count
the related words also.
The second problem is that a word
like star can be actually ambiguous.
So here it probably means
a basketball star, but
we can imagine it might also
mean a star on the sky.
So in that case, the star might actually
suggest, perhaps, a topic of science.
So we need to deal with that as well.
Finally, a main restriction of this
approach is that we have only one
term to describe the topic, so it cannot
really describe complicated topics.
For example, a very specialized
topic in sports would be harder to
describe by using just a word or
one phrase.
We need to use more words.
So this example illustrates
some general problems with
this approach of treating a term as topic.
First, it lacks expressive power.
Meaning that it can only represent
the simple general topics, but
it cannot represent the complicated topics
that might require more words to describe.
Second, it's incomplete
in vocabulary coverage,
meaning that the topic itself
is only represented as one term.
It does not suggest what other
terms are related to the topic.
Even if we're talking about sports,
there are many terms that are related.
So it does not allow us to easily
count related terms to order,
conversion to coverage of this topic.
Finally, there is this problem
of word sense disintegration.
A topical term or
related term can be ambiguous.
For example,
basketball star versus star in the sky.
So in the next lecture,
we're going to talk
about how to solve
the problem with of a topic.
[MUSIC]

This lecture is about Probabilistic Topic
Models for topic mining and analysis.
In this lecture,
we're going to continue talking
about the topic mining and analysis.
We're going to introduce
probabilistic topic models.
So this is a slide that
you have seen earlier,
where we discussed the problems
with using a term as a topic.
So, to solve these problems
intuitively we need to use
more words to describe the topic.
And this will address the problem
of lack of expressive power.
When we have more words that we
can use to describe the topic,
that we can describe complicated topics.
To address the second problem we
need to introduce weights on words.
This is what allows you to distinguish
subtle differences in topics, and
to introduce semantically
related words in a fuzzy manner.
Finally, to solve the problem of
word ambiguity, we need to split
ambiguous word, so
that we can disambiguate its topic.
It turns out that all these can be done
by using a probabilistic topic model.
And that's why we're going to spend a lot
of lectures to talk about this topic.
So the basic idea here is that,
improve the replantation of
topic as one distribution.
So what you see now is
the older replantation.
Where we replanted each topic, it was just
one word, or one term, or one phrase.
But now we're going to use a word
distribution to describe the topic.
So here you see that for sports.
We're going to use
the word distribution over
theoretical speaking all
the words in our vocabulary.
So for example, the high
probability words here are sports,
game, basketball,
football, play, star, etc.
These are sports related terms.
And of course it would also give
a non-zero probability to some other word
like Trouble which might be
related to sports in general,
not so much related to topic.
In general we can imagine a non
zero probability for all the words.
And some words that are not read and
would have very, very small probabilities.
And these probabilities will sum to one.
So that it forms a distribution
of all the words.
Now intuitively, this distribution
represents a topic in that if we assemble
words from the distribution, we tended
to see words that are ready to dispose.
You can also see, as a very special case,
if the probability of the mass
is concentrated in entirely on
just one word, it's sports.
And this basically degenerates
to the symbol foundation
of a topic was just one word.
But as a distribution,
this topic of representation can,
in general,
involve many words to describe a topic and
can model several differences
in semantics of a topic.
Similarly we can model Travel and Science
with their respective distributions.
In the distribution for Travel we see top
words like attraction, trip, flight etc.
Whereas in Science we see scientist,
spaceship, telescope, or
genomics, and, you know,
science related terms.
Now that doesn't mean sports related terms
will necessarily have zero
probabilities for science.
In general we can imagine all of these
words we have now zero probabilities.
It's just that for a particular
topic in some words we have very,
very small probabilities.
Now you can also see there are some
words that are shared by these topics.
When I say shared it just means even
with some probability threshold,
you can still see one word
occurring much more topics.
In this case I mark them in black.
So you can see travel, for example,
occurred in all the three topics here, but
with different probabilities.
It has the highest probability for
the Travel topic, 0.05.
But with much smaller probabilities for
Sports and Science, which makes sense.
And similarly, you can see a Star
also occurred in Sports and
Science with reasonably
high probabilities.
Because they might be actually
related to the two topics.
So with this replantation it addresses the
three problems that I mentioned earlier.
First, it now uses multiple
words to describe a topic.
So it allows us to describe
a fairly complicated topics.
Second, it assigns weights to terms.
So now we can model several
differences of semantics.
And you can bring in related
words together to model a topic.
Third, because we have probabilities for
the same word in different topics,
we can disintegrate the sense of word.
In the text to decode
it's underlying topic,
to address all these three problems with
this new way of representing a topic.
So now of course our problem definition
has been refined just slightly.
The slight is very similar to what
you've seen before except we have
added refinement for what our topic is.
Now each topic is word distribution,
and for each word distribution we know
that all the probabilities should sum to
one with all the words in the vocabulary.
So you see a constraint here.
And we still have another constraint
on the topic coverage, namely pis.
So all the Pi sub ij's must sum to one for
the same document.
So how do we solve this problem?
Well, let's look at this problem
as a computation problem.
So we clearly specify it's input and
output and
illustrate it here on this side.
Input of course is our text data.
C is our collection but we also generally
assume we know the number of topics, k.
Or we hypothesize a number and
then try to bind k topics,
even though we don't know the exact
topics that exist in the collection.
And V is the vocabulary that has
a set of words that determines what
units would be treated as
the basic units for analysis.
In most cases we'll use words
as the basis for analysis.
And that means each word is a unique.
Now the output would consist of as first
a set of topics represented by theta I's.
Each theta I is a word distribution.
And we also want to know the coverage
of topics in each document.
So that's.
That the same pi ijs
that we have seen before.
So given a set of text data we would
like compute all these distributions and
all these coverages as you
have seen on this slide.
Now of course there may be many
different ways of solving this problem.
In theory, you can write the [INAUDIBLE]
program to solve this problem,
but here we're going to introduce
a general way of solving this
problem called a generative model.
And this is, in fact,
a very general idea and
it's a principle way of using statistical
modeling to solve text mining problems.
And here I dimmed the picture
that you have seen before
in order to show the generation process.
So the idea of this approach is actually
to first design a model for our data.
So we design a probabilistic model
to model how the data are generated.
Of course,
this is based on our assumption.
The actual data aren't
necessarily generating this way.
So that gave us a probability
distribution of the data
that you are seeing on this slide.
Given a particular model and
parameters that are denoted by lambda.
So this template of actually consists of
all the parameters that
we're interested in.
And these parameters in general
will control the behavior of
the probability risk model.
Meaning that if you set these
parameters with different values and
it will give some data points
higher probabilities than others.
Now in this case of course,
for our text mining problem or
more precisely topic mining problem
we have the following plans.
First of all we have theta i's which
is a word distribution snd then we have
a set of pis for each document.
And since we have n documents, so we have
n sets of pis, and each set the pi up.
The pi values will sum to one.
So this is to say that we
first would pretend we already
have these word distributions and
the coverage numbers.
And then we can see how we can generate
data by using such distributions.
So how do we model the data in this way?
And we assume that the data
are actual symbols
drawn from such a model that
depends on these parameters.
Now one interesting question here is to
think about how many
parameters are there in total?
Now obviously we can already see
n multiplied by K parameters.
For pi's.
We also see k theta i's.
But each theta i is actually a set
of probability values, right?
It's a distribution of words.
So I leave this as an exercise for
you to figure out exactly how
many parameters there are here.
Now once we set up the model then
we can fit the model to our data.
Meaning that we can
estimate the parameters or
infer the parameters based on the data.
In other words we would like to
adjust these parameter values.
Until we give our data set
the maximum probability.
I just said,
depending on the parameter values,
some data points will have higher
probabilities than others.
What we're interested in, here,
is what parameter values will give
our data set the highest probability?
So I also illustrate the problem
with a picture that you see here.
On the X axis I just illustrate lambda,
the parameters,
as a one dimensional variable.
It's oversimplification, obviously,
but it suffices to show the idea.
And the Y axis shows the probability
of the data, observe.
This probability obviously depends
on this setting of lambda.
So that's why it varies as you
change the value of lambda.
What we're interested here
is to find the lambda star.
That would maximize the probability
of the observed data.
So this would be, then,
our estimate of the parameters.
And these parameters,
note that are precisely what we
hoped to discover from text data.
So we'd treat these parameters
as actually the outcome or
the output of the data mining algorithm.
So this is the general idea of using
a generative model for text mining.
First, we design a model with
some parameter values to fit
the data as well as we can.
After we have fit the data,
we will recover some parameter value.
We will use the specific
parameter value And
those would be the output
of the algorithm.
And we'll treat those as actually
the discovered knowledge from text data.
By varying the model of course we
can discover different knowledge.
So to summarize, we introduced
a new way of representing topic,
namely representing as word distribution
and this has the advantage of using
multiple words to describe a complicated
topic.It also allow us to assign
weights on words so we have more than
several variations of semantics.
We talked about the task of topic mining,
and answers.
When we define a topic as distribution.
So the importer is a clashing of text
articles and a number of topics and
a vocabulary set and
the output is a set of topics.
Each is a word distribution and
also the coverage of all
the topics in each document.
And these are formally represented
by theta i's and pi i's.
And we have two constraints here for
these parameters.
The first is the constraints
on the worded distributions.
In each worded distribution
the probability of all the words
must sum to 1,
all the words in the vocabulary.
The second constraint is on
the topic coverage in each document.
A document is not allowed to recover
a topic outside of the set of topics that
we are discovering.
So, the coverage of each of these k
topics would sum to one for a document.
We also introduce a general idea of using
a generative model for text mining.
And the idea here is, first we're design
a model to model the generation of data.
We simply assume that they
are generative in this way.
And inside the model we embed some
parameters that we're interested in
denoted by lambda.
And then we can infer the most
likely parameter values lambda star,
given a particular data set.
And we can then take the lambda star as
knowledge discovered from the text for
our problem.
And we can adjust
the design of the model and
the parameters to discover various
kinds of knowledge from text.
As you will see later
in the other lectures.
[MUSIC]

[SOUND]
>> This
lecture is about the Overview
of Statistical Language Models,
which cover proper
models as special cases.
In this lecture we're going to give
a overview of Statical Language Models.
These models are general models that cover
probabilistic topic models
as a special cases.
So first off,
what is a Statistical Language Model?
A Statistical Language Model is
basically a probability distribution
over word sequences.
So, for example,
we might have a distribution that gives,
today is Wednesday a probability of .001.
It might give today Wednesday is, which
is a non-grammatical sentence, a very,
very small probability as shown here.
And similarly another sentence,
the eigenvalue is positive might
get the probability of .00001.
So as you can see such a distribution
clearly is Context Dependent.
It depends on the Context of Discussion.
Some Word Sequences might have higher
probabilities than others but the same
Sequence of Words might have different
probability in different context.
And so this suggests that such a
distribution can actually categorize topic
such a model can also be regarded
as Probabilistic Mechanism for
generating text.
And that just means we can view text
data as data observed from such a model.
For this reason,
we call such a model as Generating Model.
So, now given a model we can then
assemble sequences of words.
So, for example, based on the distribution
that I have shown here on this slide,
when matter it say assemble
a sequence like today is Wednesday
because it has a relative
high probability.
We might often get such a sequence.
We might also get the item
value as positive sometimes
with a smaller probability and
very, very occasionally we might
get today is Wednesday because
it's probability is so small.
So in general, in order to categorize such
a distribution we must specify probability
values for
all these different sequences of words.
Obviously, it's impossible
to specify that because it's
impossible to enumerate all of
the possible sequences of words.
So in practice, we will have to
simplify the model in some way.
So, the simplest language model is
called the Unigram Language Model.
In such a case, it was simply a the text
is generated by generating
each word independently.
But in general, the words may
not be generated independently.
But after we make this assumption, we can
significantly simplify the language more.
Basically, now the probability of
a sequence of words, w1 through wn,
will be just the product of
the probability of each word.
So for such a model,
we have as many parameters as
the number of words in our vocabulary.
So here we assume we have n words,
so we have n probabilities.
One for each word.
And then some to 1.
So, now we assume that
our text is a sample
drawn according to this word distribution.
That just means,
we're going to draw a word each time and
then eventually we'll get a text.
So for example, now again,
we can try to assemble words
according to a distribution.
We might get Wednesday often or
today often.
And some other words like eigenvalue
might have a small probability, etcetera.
But with this, we actually can
also compute the probability of
every sequence, even though our model
only specify the probabilities of words.
And this is because of the independence.
So specifically, we can compute
the probability of today is Wednesday.
Because it's just a product
of the probability of today,
the probability of is, and
probability of Wednesday.
For example,
I show some fake numbers here and when you
multiply these numbers together you get
the probability that today's Wednesday.
So as you can see, with N probabilities,
one for each word, we actually
can characterize the probability situation
over all kinds of sequences of words.
And so, this is a very simple model.
Ignore the word order.
So it may not be, in fact, in some
problems, such as for speech recognition,
where you may care about
the order of words.
But it turns out to be
quite sufficient for
many tasks that involve topic analysis.
And that's also what
we're interested in here.
So when we have a model, we generally have
two problems that we can think about.
One is, given a model, how likely are we
to observe a certain kind of data points?
That is,
we are interested in the Sampling Process.
The other is the Estimation Process.
And that, is to think of
the parameters of a model given,
some observe the data and we're
going to talk about that in a moment.
Let's first talk about the sampling.
So, here I show two examples of Water
Distributions or Unigram Language Models.
The first one has higher probabilities for
words like a text mining association,
it's separate.
Now this signals a topic about text mining
because when we assemble words from
such a distribution, we tend to see words
that often occur in text mining contest.
So in this case,
if we ask the question about
what is the probability of
generating a particular document.
Then, we likely will see text that
looks like a text mining paper.
Of course, the text that we
generate by drawing words.
This distribution is unlikely coherent.
Although, the probability
of generating attacks mine
[INAUDIBLE] publishing
in the top conference is
non-zero assuming that no word has
a zero probability in the distribution.
And that just means,
we can essentially generate all kinds of
text documents including very
meaningful text documents.
Now, the second distribution show,
on the bottom, has different than
what was high probabilities.
So food [INAUDIBLE] healthy [INAUDIBLE],
etcetera.
So this clearly indicates
a different topic.
In this case it's probably about health.
So if we sample a word
from such a distribution,
then the probability of observing a text
mining paper would be very, very small.
On the other hand, the probability of
observing a text that looks like a food
nutrition paper would be high,
relatively higher.
So that just means, given a particular
distribution, different than the text.
Now let's look at
the estimation problem now.
In this case, we're going to assume
that we have observed the data.
I will know exactly what
the text data looks like.
In this case,
let's assume we have a text mining paper.
In fact, it's abstract of the paper,
so the total number of words is 100.
And I've shown some counts
of individual words here.
Now, if we ask the question,
what is the most likely
Language Model that has been
used to generate this text data?
Assuming that the text is observed
from some Language Model,
what's our best guess
of this Language Model?
Okay, so the problem now is just to
estimate the probabilities of these words.
As I've shown here.
So what do you think?
What would be your guess?
Would you guess text has
a very small probability, or
a relatively large probability?
What about query?
Well, your guess probably
would be dependent on
how many times we have observed
this word in the text data, right?
And if you think about it for a moment.
And if you are like many others,
you would have guessed that,
well, text has a probability of 10
out of 100 because I've observed
the text 10 times in the text
that has a total of 100 words.
And similarly, mining has 5 out of 100.
And query has a relatively small
probability, just observed for once.
So it's 1 out of 100.
Right, so that, intuitively,
is a reasonable guess.
But the question is, is this our best
guess or best estimate of the parameters?
Of course,
in order to answer this question,
we have to define what do we mean by best,
in this case,
it turns out that our
guesses are indeed the best.
In some sense and this is called
Maximum Likelihood Estimate.
And it's the best thing that, it will give
the observer data our maximum probability.
Meaning that, if you change
the estimate somehow, even slightly,
then the probability of the observed
text data will be somewhat smaller.
And this is called
a Maximum Likelihood Estimate.
[MUSIC]

[MUSIC]
So now let's talk about the problem
a little bit more, and specifically let's
talk about the two different ways
of estimating the parameters.
One is called the Maximum Likelihood
estimate that I already just mentioned.
The other is Bayesian estimation.
So in maximum likelihood estimation,
we define best as
meaning the data likelihood
has reached the maximum.
So formally it's given
by this expression here,
where we define the estimate as a arg
max of the probability of x given theta.
So, arg max here just means its
actually a function that will turn.
The argument that gives the function
maximum value, adds the value.
So the value of arg max is not
the value of this function.
But rather, the argument that has
made it the function reaches maximum.
So in this case the value
of arg max is theta.
It's the theta that makes the probability
of X, given theta, reach it's maximum.
So this estimate that in due it also
makes sense and it's often very useful,
and it seeks the premise
that best explains the data.
But it has a problem, when the data
is too small because when the data
points are too small,
there are very few data points.
The sample is small,
then if we trust data in entirely and
try to fit the data and
then we'll be biased.
So in the case of text data,
let's say, all observed 100
words did not contain another
word related to text mining.
Now, our maximum likelihood estimator
will give that word a zero probability.
Because giving the non-zero probability
would take away probability
mass from some observer word.
Which obviously is not optimal in
terms of maximizing the likelihood of
the observer data.
But this zero probability for
all the unseen words may not
be reasonable sometimes.
Especially, if we want the distribution
to characterize the topic of text mining.
So one way to address this problem is
actually to use Bayesian estimation,
where we actually would look
at the both the data, and
our prior knowledge about the parameters.
We assume that we have some prior
belief about the parameters.
Now in this case of course, so we are not
going to look at just the data,
but also look at the prior.
So the prior here is
defined by P of theta, and
this means, we will impose some
preference on certain theta's of others.
And by using Bayes Rule,
that I have shown here,
we can then combine
the likelihood function.
With the prior to give us this
posterior probability of the parameter.
Now, a full explanation of Bayes rule,
and some of these things
related to Bayesian reasoning,
would be outside the scope of this course.
But I just gave a brief
introduction because this is
general knowledge that
might be useful to you.
The Bayes Rule is basically defined here,
and
allows us to write down one
conditional probability of X
given Y in terms of the conditional
probability of Y given X.
And you can see the two probabilities
are different in the order
of the two variables.
But often the rule is used for
making inferences
of the variable, so
let's take a look at it again.
We can assume that p(X) Encodes
our prior belief about X.
That means before we observe any other
data, that's our belief about X,
what we believe some X values have
higher probability than others.
And this probability of X given Y
is a conditional probability, and
this is our posterior belief about X.
Because this is our belief about X
values after we have observed the Y.
Given that we have observed the Y,
now what do we believe about X?
Now, do we believe some values have
higher probabilities than others?
Now the two probabilities
are related through this one,
this can be regarded as the probability of
the observed evidence Y,
given a particular X.
So you can think about X
as our hypothesis, and
we have some prior belief about
which hypothesis to choose.
And after we have observed Y,
we will update our belief and
this updating formula is based
on the combination of our prior.
And the likelihood of observing
this Y if X is indeed true,
so much for detour about Bayes Rule.
In our case, what we are interested
in is inferring the theta values.
So, we have a prior here that includes
our prior knowledge about the parameters.
And then we have the data likelihood here,
that would tell us which parameter
value can explain the data well.
The posterior probability
combines both of them,
so it represents a compromise
of the the two preferences.
And in such a case, we can maximize
this posterior probability.
To find this theta that would
maximize this posterior probability,
and this estimator is called a Maximum
a Posteriori, or MAP estimate.
And this estimator is
a more general estimator than
the maximum likelihood estimator.
Because if we define our prior
as a noninformative prior,
meaning that it's uniform
over all the theta values.
No preference, then we basically would go
back to the maximum likelihood estimated.
Because in such a case,
it's mainly going to be determined by
this likelihood value, the same as here.
But if we have some not informative prior,
some bias towards
the different values then map estimator
can allow us to incorporate that.
But the problem here of course,
is how to define the prior.
There is no free lunch and if you want to
solve the problem with more knowledge,
we have to have that knowledge.
And that knowledge,
ideally, should be reliable.
Otherwise, your estimate may not
necessarily be more accurate than that
maximum likelihood estimate.
So, now let's look at the Bayesian
estimation in more detail.
So, I show the theta values as just a one
dimension value and
that's a simplification of course.
And so, we're interested in which
variable of theta is optimal.
So now, first we have the Prior.
The Prior tells us that
some of the variables
are more likely the others would believe.
For example, these values are more
likely than the values over here,
or here, or other places.
So this is our Prior, and
then we have our theta likelihood.
And in this case, the theta also tells us
which values of theta are more likely.
And that just means loose syllables
can best expand our theta.
And then when we combine the two
we get the posterior distribution,
and that's just a compromise of the two.
It would say that it's
somewhere in-between.
So, we can now look at some
interesting point that is made of.
This point represents the mode of prior,
that means the most likely parameter
value according to our prior,
before we observe any data.
This point is the maximum
likelihood estimator,
it represents the theta that gives
the theta of maximum probability.
Now this point is interesting,
it's the posterior mode.
It's the most likely value of the theta
given by the posterior of this.
And it represents a good
compromise of the prior mode and
the maximum likelihood estimate.
Now in general in Bayesian inference,
we are interested in
the distribution of all these
parameter additives as you see here.
If there's a distribution over
see how values that you can see.
Here, P of theta given X.
So the problem of Bayesian inference is
to infer this posterior, this regime, and
also to infer other interesting
quantities that might depend on theta.
So, I show f of theta here
as an interesting variable
that we want to compute.
But in order to compute this value,
we need to know the value of theta.
In Bayesian inference,
we treat theta as an uncertain variable.
So we think about all
the possible variables of theta.
Therefore, we can estimate the value of
this function f as extracted value of f,
according to the posterior distribution
of theta, given the observed evidence X.
As a special case, we can assume f
of theta is just equal to theta.
In this case,
we get the expected value of the theta,
that's basically the posterior mean.
That gives us also one point of theta, and
it's sometimes the same as posterior mode,
but it's not always the same.
So, it gives us another way
to estimate the parameter.
So, this is a general illustration of
Bayesian estimation and its an influence.
And later,
you will see this can be useful for
topic mining where we want to inject
the sum prior knowledge about the topics.
So to summarize,
we've used the language model
which is basically probability
distribution over text.
It's also called a generative model for
text data.
The simplest language model
is Unigram Language Model,
it's basically a word distribution.
We introduced the concept
of likelihood function,
which is the probability of
the a data given some model.
And this function is very important,
given a particular set of parameter
values this function can tell us which X,
which data point has a higher likelihood,
higher probability.
Given a data sample X,
we can use this function to determine
which parameter values would maximize
the probability of the observed data,
and this is the maximum
livelihood estimate.
We also talk about the Bayesian
estimation or inference.
In this case we, must define a prior
on the parameters p of theta.
And then we're interested in computing the
posterior distribution of the parameters,
which is proportional to the prior and
the likelihood.
And this distribution would allow us then
to infer any derive that is from theta.
[MUSIC]

[SOUND] This lecture is a continued
discussion of probabilistic topic models.
In this lecture, we're going to continue
discussing probabilistic models.
We're going to talk about
a very simple case where we
are interested in just mining
one topic from one document.
So in this simple setup,
we are interested in analyzing
one document and
trying to discover just one topic.
So this is the simplest
case of topic model.
The input now no longer has k,
which is the number of topics because we
know there is only one topic and the
collection has only one document, also.
In the output,
we also no longer have coverage because
we assumed that the document
covers this topic 100%.
So the main goal is just to discover
the world of probabilities for
this single topic, as shown here.
As always, when we think about using a
generating model to solve such a problem,
we start with thinking about what
kind of data we are going to model or
from what perspective we're going to
model the data or data representation.
And then we're going to
design a specific model for
the generating of the data,
from our perspective.
Where our perspective just means we want
to take a particular angle of looking at
the data, so that the model will
have the right parameters for
discovering the knowledge that we want.
And then we'll be thinking
about the microfunction or
write down the microfunction to
capture more formally how likely
a data point will be
obtained from this model.
And the likelihood function will have
some parameters in the function.
And then we argue our interest in
estimating those parameters for example,
by maximizing the likelihood which will
lead to maximum likelihood estimated.
These estimator parameters
will then become the output
of the mining hours,
which means we'll take the estimating
parameters as the knowledge
that we discover from the text.
So let's look at these steps for
this very simple case.
Later we'll look at this procedure for
some more complicated cases.
So our data, in this case is, just
a document which is a sequence of words.
Each word here is denoted by x sub i.
Our model is a Unigram language model.
A word distribution that we hope to
denote a topic and that's our goal.
So we will have as many parameters as many
words in our vocabulary, in this case M.
And for convenience we're
going to use theta sub i to
denote the probability of word w sub i.
And obviously these theta
sub i's will sum to 1.
Now what does a likelihood
function look like?
Well, this is just the probability
of generating this whole document,
that given such a model.
Because we assume the independence in
generating each word so the probability of
the document will be just a product
of the probability of each word.
And since some word might
have repeated occurrences.
So we can also rewrite this
product in a different form.
So in this line, we have rewritten
the formula into a product
over all the unique words in
the vocabulary, w sub 1 through w sub M.
Now this is different
from the previous line.
Well, the product is over different
positions of words in the document.
Now when we do this transformation,
we then would need to
introduce a counter function here.
This denotes the count of
word one in document and
similarly this is the count
of words of n in the document
because these words might
have repeated occurrences.
You can also see if a word did
not occur in the document.
It will have a zero count, therefore
that corresponding term will disappear.
So this is a very useful form of
writing down the likelihood function
that we will often use later.
So I want you to pay attention to this,
just get familiar with this notation.
It's just to change the product over all
the different words in the vocabulary.
So in the end, of course, we'll use
theta sub i to express this likelihood
function and it would look like this.
Next, we're going to find
the theta values or probabilities
of these words that would maximize
this likelihood function.
So now lets take a look at the maximum
likelihood estimate problem more closely.
This line is copied from
the previous slide.
It's just our likelihood function.
So our goal is to maximize
this likelihood function.
We will find it often easy to
maximize the local likelihood
instead of the original likelihood.
And this is purely for
mathematical convenience because after
the logarithm transformation our function
will becomes a sum instead of product.
And we also have constraints
over these these probabilities.
The sum makes it easier to take
derivative, which is often needed for
finding the optimal
solution of this function.
So please take a look at this sum again,
here.
And this is a form of
a function that you will often
see later also,
the more general topic models.
So it's a sum over all
the words in the vocabulary.
And inside the sum there is
a count of a word in the document.
And this is macroed by
the logarithm of a probability.
So let's see how we can
solve this problem.
Now at this point the problem is purely a
mathematical problem because we are going
to just the find the optimal solution
of a constrained maximization problem.
The objective function is
the likelihood function and
the constraint is that all these
probabilities must sum to one.
So, one way to solve the problem is
to use Lagrange multiplier approace.
Now this command is beyond
the scope of this course but
since Lagrange multiplier is a very
useful approach, I also would like
to just give a brief introduction to this,
for those of you who are interested.
So in this approach we will
construct a Lagrange function, here.
And this function will combine
our objective function
with another term that
encodes our constraint and
we introduce Lagrange multiplier here,
lambda, so it's an additional parameter.
Now, the idea of this approach is just to
turn the constraint optimization into,
in some sense,
an unconstrained optimizing problem.
Now we are just interested in
optimizing this Lagrange function.
As you may recall from calculus,
an optimal point
would be achieved when
the derivative is set to zero.
This is a necessary condition.
It's not sufficient, though.
So if we do that you will
see the partial derivative,
with respect to theta i
here ,is equal to this.
And this part comes from the derivative
of the logarithm function and
this lambda is simply taken from here.
And when we set it to zero we can
easily see theta sub i is
related to lambda in this way.
Since we know all the theta
i's must a sum to one
we can plug this into this constraint,
here.
And this will allow us to solve for
lambda.
And this is just a net
sum of all the counts.
And this further allows us to then
solve the optimization problem,
eventually, to find the optimal
setting for theta sub i.
And if you look at this formula it turns
out that it's actually very intuitive
because this is just the normalized
count of these words by the document ns,
which is also a sum of all
the counts of words in the document.
So, after all this mess, after all,
we have just obtained something
that's very intuitive and
this will be just our
intuition where we want to
maximize the data by
assigning as much probability
mass as possible to all
the observed the words here.
And you might also notice that this is
the general result of maximum likelihood
raised estimator.
In general, the estimator would be to
normalize counts and it's just sometimes
the counts have to be done in a particular
way, as you will also see later.
So this is basically an analytical
solution to our optimization problem.
In general though, when the likelihood
function is very complicated, we're not
going to be able to solve the optimization
problem by having a closed form formula.
Instead we have to use some
numerical algorithms and
we're going to see such cases later, also.
So if you imagine what would we
get if we use such a maximum
likelihood estimator to estimate one
topic for a single document d here?
Let's imagine this document
is a text mining paper.
Now, what you might see is
something that looks like this.
On the top, you will see the high
probability words tend to be those very
common words,
often functional words in English.
And this will be followed by
some content words that really
characterize the topic well like text,
mining, etc.
And then in the end,
you also see there is more probability of
words that are not really
related to the topic but
they might be extraneously
mentioned in the document.
As a topic representation,
you will see this is not ideal, right?
That because the high probability
words are functional words,
they are not really
characterizing the topic.
So my question is how can we
get rid of such common words?
Now this is the topic of the next module.
We're going to talk about how to use
probabilistic models to somehow get rid of
these common words.
[MUSIC]

[MUSIC]
This lecture is about the mixture
of unigram language models.
In this lecture we will continue
discussing probabilistic topic models.
In particular, what we introduce
a mixture of unigram language models.
This is a slide that
you have seen earlier.
Where we talked about how to
get rid of the background
words that we have on top of for
one document.
So if you want to solve the problem,
it would be useful to think about
why we end up having this problem.
Well, this obviously because these
words are very frequent in our data and
we are using a maximum
likelihood to estimate.
Then the estimate obviously would
have to assign high probability for
these words in order to
maximize the likelihood.
So, in order to get rid of them that
would mean we'd have to do something
differently here.
In particular we'll have
to say this distribution
doesn't have to explain all
the words in the tax data.
What were going to say is that,
these common words should not be
explained by this distribution.
So one natural way to solve the problem is
to think about using another distribution
to account for just these common words.
This way, the two distributions can be
mixed together to generate the text data.
And we'll let the other model which
we'll call background topic model
to generate the common words.
This way our target topic theta
here will be only generating
the common handle words that are
characterised the content of the document.
So, how does this work?
Well, it is just a small
modification of the previous setup
where we have just one distribution.
Since we now have two distributions,
we have to decide which distribution
to use when we generate the word.
Each word will still be a sample
from one of the two distributions.
Text data is still
generating the same way.
Namely, look at the generating
of the one word at each time and
eventually we generate a lot of words.
When we generate the word,
however, we're going to first decide
which of the two distributions to use.
And this is controlled by another
probability, the probability of
theta sub d and
the probability of theta sub B here.
So this is a probability of enacting
the topic word of distribution.
This is the probability of
enacting the background word
of distribution denoted by theta sub B.
On this case I just give example
where we can set both to 0.5.
So you're going to basically flip a coin,
a fair coin,
to decide what you want to use.
But in general these probabilities
don't have to be equal.
So you might bias toward using
one topic more than the other.
So now the process of generating a word
would be to first we flip a coin.
Based on these probabilities choosing
each model and if let's say the coin
shows up as head, which means we're going
to use the topic two word distribution.
Then we're going to use this word
distribution to generate a word.
Otherwise we might be
going slow this path.
And we're going to use the background
word distribution to generate a word.
So in such a case,
we have a model that has some uncertainty
associated with the use
of a word distribution.
But we can still think of this as
a model for generating text data.
And such a model is
called a mixture model.
So now let's see.
In this case, what's the probability
of observing a word w?
Now here I showed some words.
like "the" and "text".
So as in all cases,
once we setup a model we are interested
in computing the likelihood function.
The basic question is, so
what's the probability of
observing a specific word here?
Now we know that the word can be observed
from each of the two distributions, so
we have to consider two cases.
Therefore it's a sum over these two cases.
The first case is to use the topic for
the distribution to generate the word.
And in such a case then
the probably would be theta sub d,
which is the probability
of choosing the model
multiplied by the probability of actually
observing the word from that model.
Both events must happen
in order to observe.
We first must have choosing
the topic theta sub d and then,
we also have to actually have sampled
the word the from the distribution.
And similarly,
the second part accounts for
a different way of generally
the word from the background.
Now obviously the probability of
text the same is all similar, right?
So we also can see the two
ways of generating the text.
And in each case, it's a product of the
probability of choosing a particular word
is multiplied by the probability of
observing the word from that distribution.
Now whether you will see,
this is actually a general form.
So might want to make sure that you have
really understood this expression here.
And you should convince yourself that
this is indeed the probability of
obsolete text.
So to summarize what we observed here.
The probability of a word from
a mixture model is a general sum
of different ways of generating the word.
In each case,
it's a product of the probability
of selecting that component model.
Multiplied by the probability of
actually observing the data point
from that component of the model.
And this is something quite general and
you will see this occurring often later.
So the basic idea of a mixture
model is just to retrieve
thesetwo distributions
together as one model.
So I used a box to bring all
these components together.
So if you view this
whole box as one model,
it's just like any other generative model.
It would just give us
the probability of a word.
But the way that determines this
probability is quite the different from
when we have just one distribution.
And this is basically a more
complicated mixture model.
So the more complicated is more
than just one distribution.
And it's called a mixture model.
So as I just said we can treat
this as a generative model.
And it's often useful to think of
just as a likelihood function.
The illustration that
you have seen before,
which is dimmer now, is just
the illustration of this generated model.
So mathematically,
this model is nothing but
to just define the following
generative model.
Where the probability of a word is
assumed to be a sum over two cases
of generating the word.
And the form you are seeing now
is a more general form that
what you have seen in
the calculation earlier.
Well I just use the symbol
w to denote any water but
you can still see this is
basically first a sum.
Right?
And this sum is due to the fact that the
water can be generated in much more ways,
two ways in this case.
And inside a sum,
each term is a product of two terms.
And the two terms are first
the probability of selecting a component
like of D Second,
the probability of actually observing
the word from this component of the model.
So this is a very general description
of all the mixture models.
I just want to make sure
that you understand
this because this is really the basis for
understanding all kinds of on top models.
So now once we setup model.
We can write down that like
functioning as we see here.
The next question is,
how can we estimate the parameter,
or what to do with the parameters.
Given the data.
Well, in general,
we can use some of the text data
to estimate the model parameters.
And this estimation would allow us to
discover the interesting
knowledge about the text.
So you, in this case, what do we discover?
Well, these are presented
by our parameters and
we will have two kinds of parameters.
One is the two worded distributions,
that result in topics, and
the other is the coverage
of each topic in each.
The coverage of each topic.
And this is determined by
probability of C less of D and
probability of theta, so this is to one.
Now, what's interesting is
also to think about special
cases like when we send one of
them to want what would happen?
Well with the other, with the zero right?
And if you look at
the likelihood function,
it will then degenerate to the special
case of just one distribution.
Okay so you can easily verify that by
assuming one of these two is 1.0 and
the other is Zero.
So in this sense,
the mixture model is more general than
the previous model where we
have just one distribution.
It can cover that as a special case.
So to summarize, we talked about the
mixture of two Unigram Language Models and
the data we're considering
here is just One document.
And the model is a mixture
model with two components,
two unigram LM models,
specifically theta sub d,
which is intended to denote the topic of
document d, and theta sub B, which is
representing a background topic that
we can set to attract the common
words because common words would be
assigned a high probability in this model.
So the parameters can
be collectively called
Lambda which I show here you can again
think about the question about how many
parameters are we talking about exactly.
This is usually a good exercise to do
because it allows you to see the model in
depth and to have a complete understanding
of what's going on this model.
And we have mixing weights,
of course, also.
So what does a likelihood
function look like?
Well, it looks very similar
to what we had before.
So for the document,
first it's a product over all the words in
the document exactly the same as before.
The only difference is that inside here
now it's a sum instead of just one.
So you might have recalled before
we just had this one there.
But now we have this sum
because of the mixture model.
And because of the mixture model we
also have to introduce a probability of
choosing that particular
component of distribution.
And so
this is just another way of writing, and
by using a product over all the unique
words in our vocabulary instead of
having that product over all
the positions in the document.
And this form where we look at
the different and unique words is
a commutative that formed for computing
the maximum likelihood estimate later.
And the maximum likelihood estimator is,
as usual,
just to find the parameters that would
maximize the likelihood function.
And the constraints here
are of course two kinds.
One is what are probabilities in each
[INAUDIBLE] must sum to 1 the other is
the choice of each
[INAUDIBLE] must sum to 1.
[MUSIC]

[SOUND]
This
lecture is about mixture model estimation.
In this lecture we're going to continue
discussing probabilistic topic models.
In particular,
we're going to talk about how to estimate
the parameters of a mixture model.
So let's first look at our motivation for
using a mixture model.
And we hope to factor out
the background words.
From the top-words equation.
The idea is to assume that the text data
actually contained two kinds of words.
One kind is from the background here.
So, the is, we, etc.
And the other kind is from our pop board
distribution that we are interested in.
So in order to solve this problem
of factoring out background words,
we can set up our mixture model as false.
We're going to assume that we
already know the parameters of
all the values for
all the parameters in the mixture model,
except for the water distribution
of which is our target.
So this is a case of customizing
a probabilist model so
that we embedded a known variable
that we are interested in.
But we're going to simplify other things.
We're going to assume we
have knowledge above others.
And this is a powerful way
of customizing a model.
For a particular need.
Now you can imagine,
we could have assumed that we also
don't know the background words.
But in this case,
our goal is to factor out precisely
those high probability background words.
So we assume the background
model is already fixed.
And one problem here is how
can we adjust theta sub d
in order to maximize the probability
of the observed document here and
we assume all the other
perimeters are now.
Now although we designed
the model holistically.
To try to factor out
these background words.
It's unclear whether,
if we use maximum write or estimator.
We will actually end up having
a whole distribution where the Common
words like the would indeed have
smaller probabilities than before.
Now in this case it turns
out the answer is yes.
And when we set up
the probability in this way,
when we use maximum likelihood or
we will end up having a word distribution
where the use common words
would be factored out.
By the use of the background
rule of distribution.
So to understand why this is so,
it's useful to examine
the behavior of a mixture model.
So we're going to look at
a very very simple case.
In order to understand some interesting
behaviors of a mixture model.
The observed pattern here actually are
generalizable to mixture model in general.
But it's much easier to
understand this behavior
when we use A very simple case
like what we are seeing here.
So specifically in this case,
let's assume that
the probability choosing each of
the two models is exactly the same.
So we're going to flip a fair coin
to decide which model to use.
Furthermore, we're going
to assume there are.
Precisely two words, the and text.
Obviously this is a very naive
oversimplification of the actual text,
but again, it's useful to examine
the behavior in such a special case.
So we further assume that the background
model gives probability of
0.9 towards the end text 0.1.
Now, lets also assume that our data is
extremely simple the document has just
two words text and the so now lets right
down the likeable function in such a case.
First, what's the probability of text,
and what's the probably of the.
I hope by this point you'll
be able to write it down.
So the probability of text is
basically the sum over two cases,
where each case corresponds with
to each of the order distribution
and it accounts for
the two ways of generating text.
And inside each case, we have
the probability of choosing the model,
which is 0.5 multiplied by the probability
of observing text from that model.
Similarly, the,
would have a probability of the same form,
just what is different is
the exact probabilities.
So naturally our lateral function
is just a product of the two.
So It's very easy to see that,
once you understand what's
the probability of each word.
Which is also why it's so
important to understand what's exactly
the probability of observing each
word from such a mixture model.
Now, the interesting question now is,
how can we then optimize this likelihood?
Well, you will note that
there are only two variables.
They are precisely the two
probabilities of the two words.
Text [INAUDIBLE] given by theta sub d.
And this is because we have assumed
that all the other parameters are known.
So, now the question is a very
simple algebra question.
So, we have a simple expression
with two variables and
we hope to choose the values of these
two variables to maximize this function.
And the exercises that we have
seen some simple algebra problems.
Note that the two probabilities must
sum to one, so there's some constraint.
If there were no constraint of course,
we would set both probabilities to
their maximum value which would be one,
to maximize, But we can't do that
because text then the must sum to one.
We can't give both a probability of one.
So, now the question is how should
we allocate the probability and
the math between the two words.
What do you think?
Now, it would be useful to look
at this formula For a moment, and
to see what, intuitively,
what we do in order to
do set these probabilities to
maximize the value of this function.
Okay, if we look into this further,
then we see some interesting behavior
of The two component models in that
they will be collaborating to maximize
the probability of the observed data.
Which is dictated by the maximum
likelihood estimator.
But they are also competing in some way,
and in particular,
they would be competing on the words.
And they would tend to back high
probabilities on different words
to avoid this competition in some sense or
to gain advantages in this competition.
So again,
looking at this objective function and
we have a constraint on
the two probabilities.
Now, if you look at
the formula intuitively,
you might feel that you want to set the
probability of text to be somewhat larger.
And this inducing can be work supported
by mathematical fact, which is when
the sum of two variables is
a constant then the product of them
which is maximum when they are equal,
and this is a fact we know from algebra.
Now if we plug that [INAUDIBLE] It
would mean that we have to make the two
probabilities equal.
And when we make them equal and
then if we consider the constraint it
will be easy to solve this problem, and
the solution is the probability of tax
will be .09 and probability is .01.
The probability of text is now much
larger than probability of the, and
this is not the case when
have just one distribution.
And this is clearly because of
the use of the background model,
which assigned the very high probability
to the and low probability to text.
And if you look at the equation
you will see obviously
some interaction of the two
distributions here.
In particular,
you will see in order to make them equal.
And then the probability assigned
by theta sub d must be higher for
a word that has a smaller
probability given by the background.
This is obvious from
examining this equation.
Because the background part is weak for
text.
It's small.
So in order to compensate for that,
we must make the probability for
text given by theta sub D somewhat larger,
so that the two sides can be balanced.
So this is in fact a very
general behavior of this model.
And that is, if one distribution assigns a
high probability to one word than another,
then the other distribution
would tend to do the opposite.
Basically it would discourage other
distributions to do the same And
this is to balance them out so
we can account for all kinds of words.
And this also means that by using
a background model that is fixed into
assigned high probabilities
through background words.
We can indeed encourages the unknown
topical one of this to assign smaller
probabilities for such common words.
Instead put more probability
than this on the content words,
that cannot be explained well
by the background model.
Meaning that they have a very small
probability from the background motor like
text here.
[MUSIC]

[SOUND] Now lets look at another
behaviour of the Mixed Model and
in this case lets look at
the response to data frequencies.
So what you are seeing now is basically
the likelihood of function for
the two word document and
we now in this case the solution is text.
A probability of 0.9 and
the a probability of 0.1.
Now it's interesting to
think about a scenario where we start
adding more words to the document.
So what would happen if we add
many the's to the document?
Now this would change the game, right?
So, how?
Well, picture, what would
the likelihood function look like now?
Well, it start with the likelihood
function for the two words, right?
As we add more words, we know that.
But we have to just multiply
the likelihood function by
additional terms to account for
the additional.
occurrences of that.
Since in this case,
all the additional terms are the,
we're going to just multiply by this term.
Right?
For the probability of the.
And if we have another occurrence of the,
we'd multiply again by the same term,
and so on and forth.
Add as many terms as the number of
the's that we add to the document, d'.
Now this obviously changes
the likelihood function.
So what's interesting is now to think
about how would that change our solution?
So what's the optimal solution now?
Now, intuitively you'd know
the original solution,
pulling the 9 versus pulling the ,will no
longer be optimal for this new function.
Right?
But, the question is how
should we change it.
What general is to sum to one.
So he know we must take away some
probability the mass from one word and
add the probability
mass to the other word.
The question is which word to
have reduce the probability and
which word to have a larger probability.
And in particular,
let's think about the probability of the.
Should it be increased
to be more than 0.1?
Or should we decrease it to less than 0.1?
What do you think?
Now you might want to pause the video
a moment to think more about.
This question.
Because this has to do with understanding
of important behavior of a mixture model.
And indeed,
other maximum likelihood estimator.
Now if you look at the formula for
a moment, then you will see it seems like
another object Function is more
influenced by the than text.
Before, each computer.
So now as you can imagine,
it would make sense to actually
assign a smaller probability for
text and lock it.
To make room for
a larger probability for the.
Why?
Because the is repeated many times.
If we increase it a little bit,
it will have more positive impact.
Whereas a slight decrease of text
will have relatively small impact
because it occurred just one, right?
So this means there is another
behavior that we observe here.
That is high frequency words
generated with high probabilities
from all the distributions.
And, this is no surprise at all,
because after all, we are maximizing
the likelihood of the data.
So the more a word occurs, then it
makes more sense to give such a word
a higher probability because the impact
would be more on the likelihood function.
This is in fact a very general phenomenon
of all the maximum likelihood estimator.
But in this case, we can see as we
see more occurrences of a term,
it also encourages the unknown
distribution theta sub d
to assign a somewhat higher
probability to this word.
Now it's also interesting to think about
the impact of probability of Theta sub B.
The probability of choosing one
of the two component models.
Now we've been so far assuming
that each model is equally likely.
And that gives us 0.5.
But you can again look at this likelihood
function and try to picture what would
happen if we increase the probability
of choosing a background model.
Now you will see these terms for the,
we have a different form where
the probability that would be
even larger because the background has
a high probability for the word and
the coefficient in front of 0.9 which
is now 0.5 would be even larger.
When this is larger,
the overall result would be larger.
And that also makes this
the less important for
theta sub d to increase
the probability before the.
Because it's already very large.
So the impact here of increasing
the probability of the is somewhat
regulated by this coefficient,
the point of i.
If it's larger on the background,
then it becomes less important
to increase the value.
So this means the behavior here,
which is high frequency words tend to get
the high probabilities, are effected or
regularized somewhat by the probability
of choosing each component.
The more likely a component
is being chosen.
It's more important that to have higher
values for these frequent words.
If you have a various small probability of
being chosen, then the incentive is less.
So to summarize,
we have just discussed the mixture model.
And we discussed that the estimation
problem of the mixture model and
particular with this discussed some
general behavior of the estimator and
that means we can expect our
estimator to capture these infusions.
First every component model
attempts to assign high probabilities to
high frequent their words in the data.
And this is to collaboratively
maximize likelihood.
Second, different component models tend to
bet high probabilities on different words.
And this is to avoid a competition or
waste of probability.
And this would allow them to collaborate
more efficiently to maximize
the likelihood.
So, the probability of choosing each
component regulates the collaboration and
the competition between component models.
It would allow some component models
to respond more to the change,
for example, of frequency of
the theta point in the data.
We also talked about the special case
of fixing one component to a background
word distribution, right?
And this distribution can be estimated
by using a collection of documents,
a large collection of English documents,
by using just one distribution and
then we'll just have normalized
frequencies of terms to
give us the probabilities
of all these words.
Now when we use such
a specialized mixture model,
we show that we can effectively get rid
of that one word in the other component.
And that would make this cover
topic more discriminative.
This is also an example of imposing
a prior on the model parameter and
the prior here basically means one model
must be exactly the same as the background
language model and if you recall what we
talked about in Bayesian estimation, and
this prior will allow us to favor a model
that is consistent with our prior.
In fact, if it's not consistent we're
going to say the model is impossible.
So it has a zero prior probability.
That effectively excludes such a scenario.
This is also issue that
we'll talk more later.
[MUSIC]














[SOUND] This lecture is about,
Opinion Mining and Sentiment Analysis,
covering, Motivation.
In this lecture,
we're going to start, talking about,
mining a different kind of knowledge.
Namely, knowledge about the observer or
humans that have generated the text data.
In particular, we're going to talk about
the opinion mining and sentiment analysis.
As we discussed earlier, text data
can be regarded as data generated
from humans as subjective sensors.
In contrast, we have other devices such
as video recorder that can report what's
happening in the real world objective to
generate the viewer data for example.
Now the main difference between test
data and other data, like video data,
is that it has rich opinions,
and the content tends to be subjective
because it's generated from humans.
Now, this is actually a unique advantaged
of text data, as compared with other data,
because the office is a great
opportunity to understand the observers.
We can mine text data to
understand their opinions.
Understand people's preferences,
how people think about something.
So this lecture and the following lectures
will be mainly about how we can mine and
analyze opinions buried
in a lot of text data.
So let's start with
the concept of opinion.
It's not that easy to
formally define opinion, but
mostly we would define
opinion as a subjective
statement describing what a person
believes or thinks about something.
Now, I highlighted quite a few words here.
And that's because it's worth thinking
a little bit more about these words.
And that will help us better
understand what's in an opinion.
And this further helps us to
define opinion more formally.
Which is always needed to computation to
resolve the problem of opinion mining.
So let's first look at the key
word of subjective here.
This is in contrast with objective
statement or factual statement.
Those statements can be proved right or
wrong.
And this is a key differentiating
factor from opinions
which tends to be not
easy to prove wrong or
right, because it reflects what
the person thinks about something.
So in contrast, objective statement can
usually be proved wrong or correct.
For example, you might say this
computer has a screen and a battery.
Now that's something you can check.
It's either having a battery or not.
But in contrast with this, think about
the sentence such as, this laptop has
the best battery or
this laptop has a nice screen.
Now these statements
are more subjective and
it's very hard to prove
whether it's wrong or correct.
So opinion, is a subjective statement.
And next lets look at
the keyword person here.
And that indicates that
is an opinion holder.
Because when we talk about opinion,
it's about an opinion held by someone.
And then we notice that
there is something here.
So that is the target of the opinion.
The opinion is expressed
on this something.
And now, of course, believes or
thinks implies that
an opinion will depend on the culture or
background and the context in general.
Because a person might think
different in a different context.
People from different background
may also think in different ways.
So this analysis shows that there are
multiple elements that we need to include
in order to characterize opinion.
So, what's a basic opinion
representation like?
Well, it should include at
least three elements, right?
Firstly, it has to specify
what's the opinion holder.
So whose opinion is this?
Second, it must also specify the target,
what's this opinion about?
And third, of course,
we want opinion content.
And so what exactly is opinion?
If you can identify these,
we get a basic understanding of opinion
and can already be useful sometimes.
You want to understand further,
we want enriched opinion representation.
And that means we also want to
understand that, for example,
the context of the opinion and
what situation was the opinion expressed.
For example, what time was it expressed?
We, also, would like to, people understand
the opinion sentiment, and this is
to understand that what the opinion tells
us about the opinion holder's feeling.
For example, is this opinion positive,
or negative?
Or perhaps the opinion holder was happy or
was sad, and
so such understanding obvious
to those beyond just Extracting
the opinion content,
it needs some analysis.
So let's take a simple
example of a product review.
In this case, this actually expressed the
opinion holder, and expressed the target.
So its obviously whats opinion holder and
that's just reviewer and its also often
very clear whats the opinion target and
that's the product review for
example iPhone 6.
When the review is posted usually
you can't such information easier.
Now the content, of course,
is a review text that's, in general,
also easy to obtain.
So you can see product reviews are fairly
easy to analyze in terms of obtaining
a basic opinion of representation.
But of course, if you want to get more
information, you might know the Context,
for example.
The review was written in 2015.
Or, we want to know that the sentiment
of this review is positive.
So, this additional understanding of
course adds value to mining the opinions.
Now, you can see in this case the task
is relatively easy and that's
because the opinion holder and the opinion
target have already been identified.
Now let's take a look at
the sentence in the news.
In this case, we have a implicit
holder and a implicit target.
And the tasker is in general harder.
So, we can identify opinion holder here,
and that's the governor of Connecticut.
We can also identify the target.
So one target is Hurricane Sandy, but
there is also another target
mentioned which is hurricane of 1938.
So what's the opinion?
Well, there's a negative sentiment here
that's indicated by words like bad and
worst.
And we can also, then, identify context,
New England in this case.
Now, unlike in the playoff review,
all these elements must be extracted by
using natural RAM processing techniques.
So, the task Is much harder.
And we need a deeper natural
language processing.
And these examples also
suggest that a lot of work can be
easy to done for product reviews.
That's indeed what has happened.
Analyzing and
assembling news is still quite difficult,
it's more difficult than the analysis
of opinions in product reviews.
Now there are also some other
interesting variations.
In fact, here we're going to
examine the variations of opinions,
more systematically.
First, let's think about
the opinion holder.
The holder could be an individual or
it could be group of people.
Sometimes, the opinion
was from a committee.
Or from a whole country of people.
Opinion target accounts will vary a lot.
It can be about one entity,
a particular person, a particular product,
a particular policy, ect.
But it could be about a group of products.
Could be about the products
from a company in general.
Could also be very specific
about one attribute, though.
An attribute of the entity.
For example,
it's just about the battery of iPhone.
It could be someone else's opinion.
And one person might comment on
another person's Opinion, etc.
So, you can see there is a lot of
variation here that will cause
the problem to vary a lot.
Now, opinion content, of course,
can also vary a lot on the surface,
you can identify one-sentence opinion or
one-phrase opinion.
But you can also have longer
text to express an opinion,
like the whole article.
And furthermore we identify
the variation in the sentiment or
emotion damage that's above
the feeding of the opinion holder.
So, we can distinguish a positive
versus negative or mutual or
happy versus sad, separate.
Finally, the opinion
context can also vary.
We can have a simple context, like
different time or different locations.
But there could be also complex contexts,
such as some background
of topic being discussed.
So when opinion is expressed in
particular discourse context, it has to
be interpreted in different ways than
when it's expressed in another context.
So the context can be very [INAUDIBLE] to
entire discourse context of the opinion.
From computational perspective,
we're mostly interested in what opinions
can be extracted from text data.
So, it turns out that we can
also differentiate, distinguish,
different kinds of opinions in text
data from computation perspective.
First, the observer might make
a comment about opinion targeting,
observe the word So
in case we have the author's opinion.
For example,
I don't like this phone at all.
And that's an opinion of this author.
In contrast, the text might also
report opinions about others.
So the person could also Make observation
about another person's opinion and
reported this opinion.
So for example,
I believe he loves the painting.
And that opinion is really about the It is
really expressed by another person here.
So, it doesn't mean this
author loves that painting.
So clearly, the two kinds of opinions
need to be analyzed in different ways,
and sometimes in product reviews,
you can see, although mostly the opinions
are false from this reviewer.
Sometimes, a reviewer might mention
opinions of his friend or her friend.
Another complication is that
there may be indirect opinions or
inferred opinions that can be obtained.
By making inferences on
what's expressed in the text that might
not necessarily look like opinion.
For example, one statement that might be,
this phone ran out of
battery in just one hour.
Now, this is in a way a factual statement
because It's either true or false, right?
You can even verify that,
but from this statement,
one can also infer some negative opinions
about the quality of the battery of
this phone, or the feeling of
the opinion holder about the battery.
The opinion holder clearly wished
that the battery do last longer.
So these are interesting variations
that we need to pay attention to when we
extract opinions.
Also, for
this reason about indirect opinions,
it's often also very useful to extract
whatever the person has said about
the product, and sometimes factual
sentences like these are also very useful.
So, from a practical viewpoint,
sometimes we don't necessarily
extract the subject of sentences.
Instead, again, all the sentences that
are about the opinions are useful for
understanding the person or
understanding the product that we commend.
So the task of opinion mining can be
defined as taking textualized input
to generate a set of
opinion representations.
Each representation we should
identify opinion holder,
target, content, and the context.
Ideally we can also infer opinion
sentiment from the comment and
the context to better understand.
The opinion.
Now often, some elements of
the representation are already known.
I just gave a good example in
the case of product we'd use
where the opinion holder and the opinion
target are often expressly identified.
And that's not why this turns out to be
one of the simplest opinion mining tasks.
Now, it's interesting to think about
the other tasks that might be also simple.
Because those are the cases
where you can easily build
applications by using
opinion mining techniques.
So now that we have talked about what is
opinion mining, we have defined the task.
Let's also just talk a little bit about
why opinion mining is very important and
why it's very useful.
So here, I identify three major reasons,
three broad reasons.
The first is it can help decision support.
It can help us optimize our decisions.
We often look at other people's opinions,
look at read the reviews
in order to make a decisions like
buying a product or using a service.
We also would be interested
in others opinions
when we decide whom to vote for example.
And policy makers,
may also want to know people's
opinions when designing a new policy.
So that's one general,
kind of, applications.
And it's very broad, of course.
The second application is to understand
people, and this is also very important.
For example, it could help
understand people's preferences.
And this could help us
better serve people.
For example, we optimize a product search
engine or optimize a recommender system
if we know what people are interested in,
what people think about product.
It can also help with advertising,
of course, and we can have targeted
advertising if we know what kind of
people tend to like what kind of plot.
Now the third kind of application
can be called voluntary survey.
Now this is most important research
that used to be done by doing surveys,
doing manual surveys.
Question, answer it.
People need to feel informs
to answer their questions.
Now this is directly related to humans
as sensors, and we can usually aggregate
opinions from a lot of humans through
kind of assess the general opinion.
Now this would be very useful for
business intelligence where manufacturers
want to know where their products
have advantages over others.
What are the winning
features of their products,
winning features of competitive products.
Market research has to do with
understanding consumers oppinions.
And this create very useful directive for
that.
Data-driven social science research
can benefit from this because they can
do text mining to understand
the people's opinions.
And if you can aggregate a lot of opinions
from social media, from a lot of, popular
information then you can actually
do some study of some questions.
For example, we can study the behavior of
people on social media on social networks.
And these can be regarded as voluntary
survey done by those people.
In general, we can gain a lot of advantage
in any prediction task because we can
leverage the text data as
extra data above any problem.
And so we can use text based
prediction techniques to help you
make predictions or
improve the accuracy of prediction.
[MUSIC]

[SOUND]
This lecture is about using a time series
as context to potentially
discover causal topics in text.
In this lecture, we're going to continue
discussing Contextual Text Mining.
In particular, we're going to look
at the time series as a context for
analyzing text,
to potentially discover causal topics.
As usual, it started with the motivation.
In this case, we hope to use text
mining to understand a time series.
Here, what you are seeing is Dow Jones
Industrial Average stock price curves.
And you'll see a sudden drop here.
Right.
So one would be interested knowing
what might have caused the stock
market to crash.
Well, if you know the background, and
you might be able to figure it out if you
look at the time stamp, or there are other
data that can help us think about.
But the question here is can
we get some clues about this
from the companion news stream?
And we have a lot of news data
that generated during that period.
So if you do that we might
actually discover the crash.
After it happened,
at the time of the September 11 attack.
And that's the time when there
is a sudden rise of the topic
about September 11
happened in news articles.
Here's another scenario where we want
to analyze the Presidential Election.
And this is the time series that are from
the Presidential Prediction Market.
For example, I write a trunk of market
would have stocks for each candidate.
And if you believe one candidate that will
win then you tend to buy the stock for
that candidate, causing the price
of that candidate to increase.
So, that's a nice way to actual do
survey of people's opinions about
these candidates.
Now, suppose you see something
drop of price for one candidate.
And you might also want to know what
might have caused the sudden drop.
Or in a social science study, you might
be interested in knowing what method
in this election,
what issues really matter to people.
Now again in this case,
we can look at the companion news
stream and ask for the question.
Are there any clues in the news stream
that might provide insight about this?
So for example,
we might discover the mention of tax cut
has been increasing since that point.
So maybe,
that's related to the drop of the price.
So all these cases are special
cases of a general problem of joint
analysis of text and a time series
data to discover causal topics.
The input in this case is time series plus
text data that are produced in the same
time period, the companion text stream.
And this is different from
the standard topic models,
where we have just to text collection.
That's why we see time series here,
it serves as context.
Now, the output that we
want to generate is the topics
whose coverage in the text stream has
strong correlations with the time series.
For example, whenever the topic is
managing the price tends to go down, etc.
Now we call these topics Causal Topics.
Of course, they're not,
strictly speaking, causal topics.
We are never going to be able to
verify whether they are causal, or
there's a true causal relationship here.
That's why we put causal
in quotation marks.
But at least they are correlating
topics that might potentially
explain the cause and
humans can certainly further analyze such
topics to understand the issue better.
And the output would contain topics
just like in topic modeling.
But we hope that these topics are not
just the regular topics with.
These topics certainly don't have to
explain the data of the best in text, but
rather they have to explain
the data in the text.
Meaning that they have to reprehend
the meaningful topics in text.
Cement but also more importantly,
they should be correlated with external
hand series that's given as a context.
So to understand how we solve this
problem, let's first adjust to
solve the problem with reactive
topic model, for example PRSA.
And we can apply this to text stream and
with some extension like a CPRSA or
Contextual PRSA.
Then we can discover these
topics in the correlation and
also discover their coverage over time.
So, one simple solution is,
to choose the topics from
this set that have the strongest
correlation with the external time series.
But this approach is not
going to be very good.
Why?
Because
awareness pictured to the topics is
that they will discover by PRSA or LDA.
And that means the choice of
topics will be very limited.
And we know these models try to maximize
the likelihood of the text data.
So those topics tend to be the major
topics that explain the text data well.
aAnd they are not necessarily
correlated with time series.
Even if we get the best one, the most
correlated topics might still not be so
interesting from causal perspective.
So here in this work site here,
a better approach was proposed.
And this approach is called
Iterative Causal Topic Modeling.
The idea is to do an iterative
adjustment of topic,
discovered by topic models using
time series to induce a product.
So here's an illustration on
how this work, how this works.
Take the text stream as input and
then apply regular topic modeling
to generate a number of topics.
Let's say four topics.
Shown here.
And then we're going to use
external time series to assess
which topic is more causally related or
correlated with the external time series.
So we have something that rank them.
And we might think that topic one and
topic four are more correlated and
topic two and topic three are not.
Now we could have stopped here and
that would be just like what the simple
approached that I talked about earlier
then we can get to these topics and
call them causal topics.
But as I also explained that these
topics are unlikely very good
because they are general topics that
explain the whole text connection.
They are not necessary.
The best topics are correlated
with our time series.
So what we can do in this approach
is to first zoom into word level and
we can look into each word and
the top ranked word listed for each topic.
Let's say we take Topic 1
as the target examined.
We know Topic 1 is correlated
with the time series.
Or is at least the best that we could
get from this set of topics so far.
And we're going to look at the words
in this topic, the top words.
And if the topic is correlated
with the Time Series,
there must be some words that are highly
correlated with the Time Series.
So here, for example,
we might discover W1 and W3 are positively
correlated with Time Series, but
W2 and W4 are negatively correlated.
So, as a topic, and it's not good to mix
these words with different correlations.
So we can then for
the separate of these words.
We are going to get all the red words
that indicate positive correlations.
W1 and W3.
And
we're going to also get another sub topic.
If you want.
That represents a negatively
correlated words, W2 and W4.
Now, these subtopics, or these variations
of topics, based on the correlation
analysis, are topics that are still quite
related to the original topic, Topic 1.
But they are already deviating,
because of the use of time series
information for bias selection of words.
So then in some sense,
well we should expect so, some sense
more correlated with the time
series than the original Topic 1.
Because the Topic 1 has mixed words,
here we separate them.
So each of these two subtopics
can be expected to be better
coherent in this time series.
However, they may not be so
coherent as it mention.
So the idea here is to go back
to topic model by using these
each as a prior to further
guide the topic modeling.
And that's to say we ask our topic
models now discover topics that
are very similar to each
of these two subtopics.
And this will cause a bias toward more
correlate to the topics was a time series.
Of course then we can apply topic models
to get another generation of topics.
And that can be further ran to the base of
the time series to set after the highly
correlated topics.
And then we can further analyze
the components at work in the topic and
then try to analyze.word
level correlation.
And then get the even more
correlated subtopics that can be
further fed into the process as prior
to drive the topic of model discovery.
So this whole process is just a heuristic
way of optimizing causality and
coherence, and that's our ultimate goal.
Right?
So here you see the pure topic
models will be very good at
maximizing topic coherence,
the topics will be all meaningful.
If we only use causality test,
or correlation measure,
then we might get a set words that
are strongly correlate with time series,
but they may not
necessarily mean anything.
It might not be cementric connected.
So, that would be at the other extreme,
on the top.
Now, the ideal is to get the causal
topic that's scored high,
both in topic coherence and
also causal relation.
In this approach,
it can be regarded as an alternate
way to maximize both sine engines.
So when we apply the topic models
we're maximizing the coherence.
But when we decompose the topic
model words into sets
of words that are very strong
correlated with the time series.
We select the most strongly correlated
words with the time series.
We are pushing the model
back to the causal
dimension to make it
better in causal scoring.
And then, when we apply
the selected words as a prior
to guide a topic modeling, we again
go back to optimize the coherence.
Because topic models, we ensure the next
generation of topics to be coherent and
we can iterate when they're optimized
in this way as shown on this picture.
So the only I think a component that you
haven't seen such a framework is how
to measure the causality.
Because the rest is just talking more on.
So let's have a little bit
of discussion of that.
So here we show that.
And let's say we have a topic
about government response here.
And then we just talking more of we can
get coverage of the topic over time.
So, we have a time series, X sub t.
Now, we also have, are give a time series
that represents external information.
It's a non text time series, Y sub t.
It's the stock prices.
Now the the question
here is does Xt cause Yt?
Well in other words, we want to match
the causality relation between the two.
Or maybe just measure
the correlation of the two.
There are many measures that
we can use in this framework.
For example, pairs in correlation
is a common use measure.
And we got to consider time lag here so
that we can try to
capture causal relation.
Using somewhat past data and
using the data in the past
to try to correlate with the data on
points of y that represents the future,
for example.
And by introducing such lag, we can
hopefully capture some causal relation by
even using correlation measures
like person correlation.
But a common use, the measure for
causality here is Granger Causality Test.
And the idea of this test
is actually quite simple.
Basically you're going to have
all the regressive model to
use the history information
of Y to predict itself.
And this is the best we could
without any other information.
So we're going to build such a model.
And then we're going to add some history
information of X into such model.
To see if we can improve
the prediction of Y.
If we can do that with a statistically
significant difference.
Then we just say X has some
causal inference on Y,
or otherwise it wouldn't have causal
improvement of prediction of Y.
If, on the other hand,
the difference is insignificant and
that would mean X does not really
have a cause or relation why.
So that's the basic idea.
Now, we don't have time to explain
this in detail so you could read, but
you would read at this cited reference
here to know more about this measure.
It's a very convenient used measure.
Has many applications.
So next, let's look at some simple
results generated by this approach.
And here the data is
the New York Times and
in the time period of June
2000 through December of 2011.
And here the time series we used
is stock prices of two companies.
American Airlines and Apple and
the goal is to see if we inject
the sum time series contest,
whether we can actually get topics
that are wise for the time series.
Imagine if we don't use any input,
we don't use any context.
Then the topics from New York
times discovered by PRSA would be
just general topics that
people talk about in news.
All right.
Those major topics in the news event.
But here you see these topics are indeed
biased toward each time series.
And particularly if you look
at the underlined words here
in the American Airlines result,
and you see airlines,
airport, air, united trade,
or terrorism, etc.
So it clearly has topics that are more
correlated with the external time series.
On the right side,
you see that some of the topics
are clearly related to Apple, right.
So you can see computer, technology,
software, internet, com, web, etc.
So that just means the time series
has effectively served as a context
to bias the discovery of topics.
From another perspective,
these results help us on what people
have talked about in each case.
So not just the people,
what people have talked about,
but what are some topics that might be
correlated with their stock prices.
And so these topics can serve
as a starting point for
people to further look into issues and
you'll find the true causal relations.
Here are some other results from analyzing
Presidential Election time series.
The time series data here is
from Iowa Electronic market.
And that's a prediction market.
And the data is the same.
New York Times from May
2000 to October 2000.
That's for
2000 presidential campaign election.
Now, what you see here
are the top three words in significant
topics from New York Times.
And if you look at these topics, and they
are indeed quite related to the campaign.
Actually the issues
are very much related to
the important issues of
this presidential election.
Now here I should mention that the text
data has been filtered by using
only the articles that mention
these candidate names.
It's a subset of these news articles.
Very different from
the previous experiment.
But the results here clearly show
that the approach can uncover some
important issues in that
presidential election.
So tax cut, oil energy, abortion and
gun control are all known
to be important issues in
that presidential election.
And that was supported by some
literature in political science.
And also I was discussing Wikipedia,
right.
So basically the results show
that the approach can effectively
discover possibly causal topics
based on the time series data.
So there are two suggested readings here.
One is the paper about this iterative
topic modeling with time series feedback.
Where you can find more details
about how this approach works.
And the second one is reading
about Granger Casuality text.
So in the end, let's summarize
the discussion of Text-based Prediction.
Now, Text-based prediction
is generally very useful for
big data applications that involve text.
Because they can help us inform
new knowledge about the world.
And the knowledge can go beyond
what's discussed in the text.
As a result can also support
optimizing of our decision making.
And this has a wider spread application.
Text data is often combined with
non-text data for prediction.
because, for this purpose,
the prediction purpose,
we generally would like to combine
non-text data and text data together,
as much cruel as possible for prediction.
And so as a result during
the analysis of text and
non-text is very necessary and
it's also very useful.
Now when we analyze text data
together with non-text data,
we can see they can help each other.
So non-text data, provide a context for
mining text data, and
we discussed a number of techniques for
contextual text mining.
And on the other hand,
a text data can also help interpret
patterns discovered from non-text data,
and this is called a pattern annotation.
In general,
this is a very active research topic, and
there are new papers being published.
And there are also many open
challenges that have to be solved.
[MUSIC]

This lecture is a summary
of this whole course.
First, let's revisit the topics
that we covered in this course.
In the beginning, we talked about
the natural language processing and
how it can enrich text representation.
We then talked about how to mine
knowledge about the language,
natural language used to express the,
what's observing the world in text and
data.
In particular, we talked about
how to mine word associations.
We then talked about how
to analyze topics in text.
How to discover topics and analyze them.
This can be regarded as
knowledge about observed world,
and then we talked about how to mine
knowledge about the observer and
particularly talk about the, how to
mine opinions and do sentiment analysis.
And finally, we will talk about
the text-based prediction, which has to
do with predicting values of other real
world variables based on text data.
And in discussing this, we will also
discuss the role of non-text data,
which can contribute additional
predictors for the prediction problem,
and also can provide context for
analyzing text data, and
in particular we talked about how
to use context to analyze topics.
So here are the key high-level
take away messages from this cost.
I going to go over these major topics and
point out what are the key take-away
messages that you should remember.
First the NLP and text representation.
You should realize that NLP
is always very important for
any text replication because it
enriches text representation.
The more NLP the better text
representation we can have.
And this further enables more
accurate knowledge discovery,
to discover deeper knowledge,
buried in text.
However, the current estate of art
of natural energy processing is,
still not robust enough.
So, as an result,
the robust text mining technologies today,
tend to be based on world [INAUDIBLE].
And tend to rely a lot
on statistical analysis,
as we've discussed in this course.
And you may recall we've mostly
used word based representations.
And we've relied a lot on
statistical techniques,
statistical learning
techniques particularly.
In word-association mining and
analysis the important points first,
we are introduced the two concepts for
two basic and
complementary relations of words,
paradigmatic and syntagmatic relations.
These are actually very general
relations between elements sequences.
If you take it as meaning
elements that occur in similar
context in the sequence and elements
that tend to co-occur with each other.
And these relations might be also
meaningful for other sequences of data.
We also talked a lot about
test the similarity then we
discuss how to discover
paradynamic similarities compare
the context of words discover
words that share similar context.
At that point level,
we talked about representing text
data with a vector space model.
And we talked about some retrieval
techniques such as BM25 for
measuring similarity of text and
for assigning weights to terms,
tf-idf weighting, et cetera.
And this part is well-connected
to text retrieval.
There are other techniques that
can be relevant here also.
The next point is about
co-occurrence analysis of text, and
we introduce some information
theory concepts such as entropy,
conditional entropy,
and mutual information.
These are not only very useful for
measuring the co-occurrences of words,
they are also very useful for
analyzing other kind of data, and
they are useful for, for example, for
feature selection in text
categorization as well.
So this is another important concept,
good to know.
And then we talked about
the topic mining and analysis, and
that's where we introduce in
the probabilistic topic model.
We spent a lot of time to
explain the basic topic model,
PLSA in detail and this is, those are the
basics for understanding LDA which is.
Theoretically, a more opinion model, but
we did not have enough time to really
go in depth in introducing LDA.
But in practice,
PLSA seems as effective as LDA and
it's simpler to implement and
it's also more efficient.
In this part of Wilson videos is some
general concepts that would be useful to
know, one is generative model,
and this is a general method for
modeling text data and
modeling other kinds of data as well.
And we talked about the maximum life
erase data, the EM algorithm for
solving the problem of
computing maximum estimator.
So, these are all general techniques
that tend to be very useful
in other scenarios as well.
Then we talked about the text
clustering and the text categorization.
Those are two important building blocks
in any text mining application systems.
In text with clustering we talked
about how we can solve the problem by
using a slightly different mixture module
than the probabilistic topic model.
and we then also prefer to
view the similarity based
approaches to test for cuss word.
In categorization we also talk
about the two kinds of approaches.
One is generative classifies
that rely on to base word to
infer the condition of or
probability of a category given text data,
in deeper we'll introduce you should
use [INAUDIBLE] base in detail.
This is the practical use for technique,
for a lot of text, capitalization tasks.
We also introduce the some
discriminative classifiers,
particularly logistical regression,
can nearest labor and SBN.
They also very important, they are very
popular, they are very useful for
text capitalization as well.
In both parts, we'll also discuss
how to evaluate the results.
Evaluation is quite important because if
the matches that you use don't really
reflect the volatility of the method then
it would give you misleading results so
its very important to
get the variation right.
And we talked about variation of
categorization in detail was a lot of
specific measures.
Then we talked about the sentiment
analysis and the paradigm and
that's where we introduced
sentiment classification problem.
And although it's a special
case of text recalculation, but
we talked about how to extend or
improve the text recalculation method
by using more sophisticated features that
would be needed for sentiment analysis.
We did a review of some common use for
complex features for text analysis, and
then we also talked about how to
capture the order of these categories,
in sentiment classification, and
in particular we introduced ordinal
logistical regression then we also talked
about Latent Aspect Rating Analysis.
This is an unsupervised way of using
a generative model to understand and
review data in more detail.
In particular, it allows us to
understand the composed ratings of
a reviewer on different
aspects of a topic.
So given text reviews
with overall ratings,
the method allows even further
ratings on different aspects.
And it also allows us to infer,
the viewers laying their
weights on these aspects or
which aspects are more important to
a viewer can be revealed as well.
And this enables a lot of
interesting applications.
Finally, in the discussion of prediction,
we mainly talk about the joint mining
of text and non text data, as they
are both very important for prediction.
We particularly talked about how text data
can help non-text data and vice versa.
In the case of using non-text
data to help text data analysis,
we talked about
the contextual text mining.
We introduced the contextual PLSA as a
generalizing or generalized model of PLSA
to allows us to incorporate the context
of variables, such as time and location.
And this is a general way to allow us
to reveal a lot of interesting topic
of patterns in text data.
We also introduced the net PLSA,
in this case we used social network or
network in general of text
data to help analyze puppets.
And finally we talk about how
can be used as context to
mine potentially causal
Topics in text layer.
Now, in the other way of using text to
help interpret patterns
discovered from LAM text data,
we did not really discuss anything in
detail but just provide a reference but
I should stress that that's after a very
important direction to know about,
if you want to build a practical
text mining systems,
because understanding and
interpreting patterns is quite important.
So this is a summary of the key
take away messages, and
I hope these will be very
useful to you for building any
text mining applications or to you for
the starting of these algorithms.
And this should provide a good basis for
you to read from your research papers,
to know more about more of allowance for
other organisms or
to invent new hours in yourself.
So to know more about this topic,
I would suggest you to look
into other areas in more depth.
And during this short period
of time of this course,
we could only touch the basic concepts,
basic principles, of text mining and
we emphasize the coverage
of practical algorithms.
And this is after the cost
of covering algorithms and
in many cases we omit the discussion
of a lot of algorithms.
So to learn more about the subject
you should definitely learn more
about the natural language process
because this is foundation for
all text based applications.
The more NLP you can do, the better
the additional text that you can get, and
then the deeper knowledge
you can discover.
So this is very important.
The second area you should look into
is the Statistical Machine Learning.
And these techniques are now
the backbone techniques for
not just text analysis applications but
also for NLP.
A lot of NLP techniques are nowadays
actually based on supervised machinery.
So, they are very important
because they are a key
to also understanding some
advancing NLP techniques and
naturally they will provide more tools for
doing text analysis in general.
Now, a particularly interesting area,
called deep learning has attracted
a lot of attention recently.
It has also shown promise
in many application areas,
especially in speech and vision, and
has been applied to text data as well.
So, for example, recently there has
work on using deep learning to do
segment analysis to
achieve better accuracy.
So that's one example of [INAUDIBLE]
techniques that we weren't able to cover,
but that's also very important.
And the other area that has emerged
in status learning is the water and
baring technique, where they can
learn better recognition of words.
And then these better recognitions will
allow you confuse similarity of words.
As you can see,
this provides directly a way to discover
the paradigmatic relations of words.
And results that people have got,
so far, are very impressive.
That's another promising technique
that we did not have time to touch,
but, of course,
whether these new techniques
would lead to practical useful techniques
that work much better than the current
technologies is still an open
question that has to be examined.
And no serious evaluation
has been done yet.
In, for example, examining
the practical value of word embedding,
other than word similarity and
basic evaluation.
But nevertheless,
these are advanced techniques
that surely will make impact
in text mining in the future.
So its very important to
know more about these.
Statistical learning is also the key to
predictive modeling which is very crucial
for many big data applications and we did
not talk about that predictive modeling
component but this is mostly about
the regression or categorization
techniques and this is another reason
why statistical learning is important.
We also suggest that you learn more about
data mining, and that's simply because
general data mining algorithms can always
be applied to text data, which can be
regarded as as special
case of general data.
So there are many applications
of data mining techniques.
In particular for example, pattern
discovery would be very useful to generate
the interesting features for test analysis
and the reason that an information network
that mining techniques can also be used
to analyze text information at work.
So these are all good to know.
In order to develop effective
text analysis techniques.
And finally, we also recommend you to
learn more about the text retrieval,
information retrieval, of search engines.
This is especially important if you
are interested in building practical text
application systems.
And a search ending would
be an essential system
component in any text-based applications.
And that's because texts data
are created for humans to consume.
So humans are at the best position
to understand text data and
it's important to have human in the loop
in big text data applications, so
it can in particular help text
mining systems in two ways.
One is through effectively reduce
the data size from a large collection to
a small collection with the most
relevant text data that only matter for
the particular interpretation.
So the other is to provide a way to
annotate it, to explain parents,
and this has to do with
knowledge providence.
Once we discover some knowledge,
we have to figure out whether or
not the discovery is really reliable.
So we need to go back to
the original text to verify that.
And that is why the search
engine is very important.
Moreover, some techniques
of information retrieval,
for example BM25, vector space and
are also very useful for text data mining.
We only mention some of them,
but if you know more about
text retrieval you'll see that there
are many techniques that are used for it.
Another technique that it's used for
is indexing technique that enables quick
response of search engine to a user's
query, and such techniques can be
very useful for building efficient
text mining systems as well.
So, finally, I want to remind
you of this big picture for
harnessing big text data that I showed
you at your beginning of the semester.
So in general, to deal with
a big text application system,
we need two kinds text,
text retrieval and text mining.
And text retrieval, as I explained,
is to help convert big text data into
a small amount of most relevant data for
a particular problem, and can also help
providing knowledge provenance,
help interpreting patterns later.
Text mining has to do with further
analyzing the relevant data to discover
the actionable knowledge that can be
directly useful for decision making or
many other tasks.
So this course covers text mining.
And there's a companion course
called Text Retrieval and
Search Engines that covers text retrieval.
If you haven't taken that course,
it would be useful for you to take it,
especially if you are interested
in building a text caching system.
And taking both courses will give you
a complete set of practical skills for
building such a system.
So in [INAUDIBLE]
I just would like to thank you for
taking this course.
I hope you have learned useful knowledge
and skills in test mining and [INAUDIBLE].
As you see from our discussions
there are a lot of opportunities for
this kind of techniques and
there are also a lot of open channels.
So I hope you can use what you have
learned to build a lot of use for
applications will benefit society and
to also join
the research community to discover new
techniques for text mining and benefits.
Thank you.
[MUSIC]

[NOISE]
This
lecture is about
the sentiment classification.
If we assume that
most of the elements in the opinion
representation are all ready known,
then our only task may be just a sentiment
classification, as shown in this case.
So suppose we know who's the opinion
holder and what's the opinion target,
and also know the content and the context
of the opinion, then we mainly need to
decide the opinion
sentiment of the review.
So this is a case of just using sentiment
classification for understanding opinion.
Sentiment classification can be
defined more specifically as follows.
The input is opinionated text object,
the output is typically a sentiment label,
or a sentiment tag, and
that can be designed in two ways.
One is polarity analysis, where we have
categories such as positive, negative,
or neutral.
The other is emotion
analysis that can go beyond
a polarity to characterize
the feeling of the opinion holder.
In the case of polarity analysis,
we sometimes
also have numerical ratings as you
often see in some reviews on the web.
Five might denote the most positive, and
one maybe the most negative, for example.
In general, you have just disk holder
categories to characterize the sentiment.
In emotion analysis, of course,
there are also different ways for
design the categories.
The six most frequently
used categories are happy,
sad, fearful, angry,
surprised, and disgusted.
So as you can see, the task is essentially
a classification task, or categorization
task, as we've seen before, so it's
a special case of text categorization.
This also means any textual categorization
method can be used to do sentiment
classification.
Now of course if you just do that,
the accuracy may not be good
because sentiment classification
does requires some improvement over
regular text categorization technique,
or simple text categorization technique.
In particular,
it needs two kind of improvements.
One is to use more sophisticated features
that may be more appropriate for
sentiment tagging as I
will discuss in a moment.
The other is to consider
the order of these categories, and
especially in polarity analysis,
it's very clear there's an order here,
and so these categories
are not all that independent.
There's order among them, and so
it's useful to consider the order.
For example, we could use
ordinal regression to do that,
and that's something that
we'll talk more about later.
So now, let's talk about some features
that are often very useful for
text categorization and
text mining in general, but
some of them are especially also
needed for sentiment analysis.
So let's start from the simplest one,
which is character n-grams.
You can just have a sequence
of characters as a unit,
and they can be mixed with different n's,
different lengths.
All right, and
this is a very general way and
very robust way to
represent the text data.
And you could do that for
any language, pretty much.
And this is also robust to spelling
errors or recognition errors, right?
So if you misspell a word by one character
and this representation actually would
allow you to match this word when
it occurs in the text correctly.
Right, so misspell the word and
the correct form can be matched because
they contain some common
n-grams of characters.
But of course such a recommendation
would not be as discriminating as words.
So next, we have word n-grams,
a sequence of words and again,
we can mix them with different n's.
Unigram's are actually often very
effective for a lot of text processing
tasks, and it's mostly because words
are word designed features by humans for
communication, and so
they are often good enough for many tasks.
But it's not good, or not sufficient for
sentiment analysis clearly.
For example, we might see a sentence like,
it's not good or
it's not as good as something else, right?
So in such a case if you
just take a good and
that would suggest positive that's not
good, all right so it's not accurate.
But if you take a bigram, not good
together, and then it's more accurate.
So longer n-grams are generally more
discriminative, and they're more specific.
If you match it, and it says a lot, and
it's accurate it's unlikely,
very ambiguous.
But it may cause overfitting because with
such very unique features that machine
oriented program can easily pick up
such features from the training set and
to rely on such unique features
to distinguish the categories.
And obviously, that kind of classify, one
would generalize word to future there when
such discriminative features
will not necessarily occur.
So that's a problem of
overfitting that's not desirable.
We can also consider part of speech tag,
n-grams if we can do part of
speech tagging an, for example,
adjective noun could form a pair.
We can also mix n-grams of words and
n-grams of part of speech tags.
For example, the word great might be
followed by a noun, and this could become
a feature, a hybrid feature, that could
be useful for sentiment analysis.
So next we can also have word classes.
So these classes can be syntactic like a
part of speech tags, or could be semantic,
and they might represent concepts in
the thesaurus or ontology, like WordNet.
Or they can be recognized the name
entities, like people or place, and
these categories can be used to enrich
the presentation as additional features.
We can also learn word clusters and
parodically, for example,
we've talked about the mining
associations of words.
And so we can have cluster of
paradigmatically related words or
syntaxmatically related words, and
these clusters can be features to
supplement the word base representation.
Furthermore, we can also have
frequent pattern syntax, and
these could be frequent word set,
the words that
form the pattern do not necessarily
occur together or next to each other.
But we'll also have locations where
the words my occur more closely together,
and such
patterns provide a more discriminative
features than words obviously.
And they may also generalize better
than just regular n-grams because they
are frequent.
So you expected them to
occur also in tested data.
So they have a lot of advantages, but
they might still face the problem
of overfeeding as the features
become more complex.
This is a problem in general, and the same
is true for parse tree-based features,
when you can use a parse tree to derive
features such as frequent subtrees, or
paths, and
those are even more discriminating, but
they're also are more likely
to cause over fitting.
And in general, pattern discovery
algorithm's are very useful for
feature construction because they allow
us to search in a large space of possible
features that are more complex than
words that are sometimes useful.
So in general, natural language
processing is very important that
they derive complex features, and
they can enrich text representation.
So for example,
this is a simple sentence that I showed
you a long time ago in another lecture.
So from these words we can only
derive simple word n-grams,
representations or character n-grams.
But with NLP,
we can enrich the representation
with a lot of other information such
as part of speech tags, parse trees or
entities, or even speech act.
Now with such enriching information
of course, then we can generate a lot
of other features, more complex features
like a mixed grams of a word and
the part of speech tags, or
even a part of a parse tree.
So in general, feature design actually
affects categorization accuracy
significantly, and it's a very important
part of any machine learning application.
In general, I think it would be
most effective if you can combine
machine learning, error analysis, and
domain knowledge in design features.
So first you want to
use the main knowledge,
your understanding of the problem,
the design seed features, and
you can also define a basic feature space
with a lot of possible features for
the machine learning program to work on,
and machine can be applied to select
the most effective features or
construct the new features.
That's feature learning, and
these features can then be further
analyzed by humans through error analysis.
And you can look at
the categorization errors, and
then further analyze what features can
help you recover from those errors,
or what features cause overfitting and
cause those errors.
And so this can lead into
feature validation that will
revised the feature set,
and then you can iterate.
And we might consider using
a different features space.
So NLP enriches text
recognition as I just said, and
because it enriches the feature space,
it allows much larger such a space
of features and there are also many,
many more features that can be
very useful for a lot of tasks.
But be careful not to use a lot
of category features because
it can cause overfitting,
or otherwise you would
have to training careful
not to let overflow happen.
So a main challenge in design features,
a common challenge is to optimize
a trade off between exhaustivity and
the specificity, and this trade off
turns out to be very difficult.
Now exhaustivity means we want
the features to actually have
high coverage of a lot of documents.
And so in that sense,
you want the features to be frequent.
Specifity requires the feature
to be discriminative, so
naturally infrequent the features
tend to be more discriminative.
So this really cause a trade off between
frequent versus infrequent features.
And that's why a featured
design is usually odd.
And that's probably the most important
part in machine learning any
problem in particularly in our case,
for text categoration or
more specifically
the senitment classification.
[MUSIC]

[NOISE] This lecture is about the ordinal
logistic regression for
sentiment analysis.
So, this is our problem set up for a
typical sentiment classification problem.
Or more specifically a rating prediction.
We have an opinionated text document d as
input, and we want to generate as output,
a rating in the range of 1 through k so
it's a discrete rating, and
this is a categorization problem.
We have k categories here.
Now we could use a regular text for
categorization technique
to solve this problem.
But such a solution would not consider the
order and dependency of the categories.
Intuitively, the features that can
distinguish category 2 from 1,
or rather rating 2 from 1,
may be similar to
those that can distinguish k from k-1.
For example, positive words
generally suggest a higher rating.
When we train categorization
problem by treating these categories as
independent we would not capture this.
So what's the solution?
Well in general we can order to classify
and there are many different approaches.
And here we're going to
talk about one of them that
called ordinal logistic regression.
Now, let's first think about how
we use logistical regression for
a binary sentiment.
A categorization problem.
So suppose we just wanted to distinguish
a positive from a negative and
that is just a two category
categorization problem.
So the predictors are represented as X and
these are the features.
And there are M features all together.
The feature value is a real number.
And this can be representation
of a text document.
And why it has two values,
binary response variable 0 or 1.
1 means X is positive,
0 means X is negative.
And then of course this is a standard
two category categorization problem.
We can apply logistical regression.
You may recall that in logistical
regression, we assume the log
of probability that the Y is equal to one,
is
assumed to be a linear function
of these features, as shown here.
So this would allow us to also write
the probability of Y equals one, given X
in this equation that you
are seeing on the bottom.
So that's a logistical function and
you can see it relates
this probability to,
probability that y=1
to the feature values.
And of course beta i's
are parameters here, so this is
just a direct application of logistical
regression for binary categorization.
What if we have multiple categories,
multiple levels?
Well we have to use such a binary
logistical regression problem
to solve this multi
level rating prediction.
And the idea is we can introduce
multiple binary class files.
In each case we asked
the class file to predict the,
whether the rating is j or above,
or the rating's lower than j.
So when Yj is equal to 1,
it means rating is j or above.
When it's 0,
that means the rating is Lower than j.
So basically if we want to predict
a rating in the range of 1-k,
we first have one classifier to
distinguish a k versus others.
And that's our classifier one.
And then we're going to have another
classifier to distinguish it.
At k-1 from the rest.
That's Classifier 2.
And in the end, we need a Classifier
to distinguish between 2 and 1.
So altogether we'll have k-1 classifiers.
Now if we do that of course then
we can also solve this problem
and the logistical regression program
will be also very straight forward
as you have just seen
on the previous slide.
Only that here we have more parameters.
Because for each classifier,
we need a different set of parameters.
So now the logistical regression
classifies index by J,
which corresponds to a rating level.
And I have also used of
J to replace beta 0.
And this is to.
Make the notation more consistent,
than was what we can show in
the ordinal logistical regression.
So here we now have basically k minus one
regular logistic regression classifiers.
Each has it's own set of parameters.
So now with this approach,
we can now do ratings as follows.
After we have trained these k-1
logistic regression classifiers,
separately of course,
then we can take a new instance and
then invoke a classifier
sequentially to make the decision.
So first let look at the classifier
that corresponds to level of rating K.
So this classifier will tell
us whether this object should
have a rating of K or about.
If probability according to this
logistical regression classifier is
larger than point five,
we're going to say yes.
The rating is K.
Now, what if it's not as
large as twenty-five?
Well, that means the rating's below K,
right?
So now,
we need to invoke the next classifier,
which tells us whether
it's above K minus one.
It's at least K minus one.
And if the probability is
larger than twenty-five,
then we'll say, well, then it's k-1.
What if it says no?
Well, that means the rating
would be even below k-1.
And so we're going to just keep
invoking these classifiers.
And here we hit the end when we need
to decide whether it's two or one.
So this would help us solve the problem.
Right?
So we can have a classifier that would
actually give us a prediction of a rating
in the range of 1 through k.
Now unfortunately such a strategy is not
an optimal way of solving this problem.
And specifically there are two
problems with this approach.
So these equations are the same as.
You have seen before.
Now the first problem is that there
are just too many parameters.
There are many parameters.
Now, can you count how many
parameters do we have exactly here?
Now this may be a interesting exercise.
To do.
So
you might want to just pause the video and
try to figure out the solution.
How many parameters do I have for
each classifier?
And how many classifiers do we have?
Well you can see the, and so
it is that for each classifier we have
n plus one parameters, and we have k
minus one classifiers all together,
so the total number of parameters is
k minus one multiplied by n plus one.
That's a lot.
A lot of parameters, so when
the classifier has a lot of parameters,
we would in general need a lot of data
out to actually help us, training data,
to help us decide the optimal
parameters of such a complex model.
So that's not ideal.
Now the second problems
is that these problems,
these k minus 1 plus fives,
are not really independent.
These problems are actually dependent.
In general, words that are positive
would make the rating higher
for any of these classifiers.
For all these classifiers.
So we should be able to take
advantage of this fact.
Now the idea of ordinal logistical
regression is precisely that.
The key idea is just
the improvement over the k-1
independent logistical
regression classifiers.
And that idea is to tie
these beta parameters.
And that means we are going to
assume the beta parameters.
These are the parameters that indicated
the inference of those weights.
And we're going to assume these
beta values are the same for
all the K- 1 parameters.
And this just encodes our intuition that,
positive words in general would
make a higher rating more likely.
So this is intuitively assumptions,
so reasonable for our problem setup.
And we have this order
in these categories.
Now in fact, this would allow us
to have two positive benefits.
One is it's going to reduce
the number of families significantly.
And the other is to allow us
to share the training data.
Because all these parameters
are similar to be equal.
So these training data, for
different classifiers can then be
shared to help us set
the optimal value for beta.
So we have more data to help
us choose a good beta value.
So what's the consequence,
well the formula would look very similar
to what you have seen before only that,
now the beta parameter has just one
index that corresponds to the feature.
It no longer has the other index that
corresponds to the level of rating.
So that means we tie them together.
And there's only one set of better
values for all the classifiers.
However, each classifier still
has the distinct R for value.
The R for parameter.
Except it's different.
And this is of course needed to predict
the different levels of ratings.
So R for sub j is different it
depends on j, different than j,
has a different R value.
But the rest of the parameters,
the beta i's are the same.
So now you can also ask the question,
how many parameters do we have now?
Again, that's an interesting
question to think about.
So if you think about it for a moment, and
you will see now, the param,
we have far fewer parameters.
Specifically we have M plus K minus one.
Because we have M, beta values, and
plus K minus one of our values.
So let's just look basically,
that's basically the main idea of
ordinal logistical regression.
So, now, let's see how we can use such
a method to actually assign ratings.
It turns out that with this, this idea of
tying all the parameters, the beta values.
We also end up by having
a similar way to make decisions.
And more specifically now, the criteria
whether the predictor probabilities
are at least 0.5 above,
and now is equivalent to
whether the score of
the object is larger than or
equal to negative authors of j,
as shown here.
Now, the scoring function is just
taking the linear combination of
all the features with
the divided beta values.
So, this means now we can simply make
a decision of rating, by looking at
the value of this scoring function,
and see which bracket it falls into.
Now you can see the general
decision rule is thus,
when the score is in the particular
range of all of our values,
then we will assign the corresponding
rating to that text object.
So in this approach,
we're going to score the object
by using the features and
trained parameter values.
This score will then be
compared with a set of trained
alpha values to see which
range the score is in.
And then,
using the range, we can then decide which
rating the object should be getting.
Because, these ranges of alpha
values correspond to the different
levels of ratings, and that's from
the way we train these alpha values.
Each is tied to some level of rating.
[MUSIC]

[MUSIC]
This lecture is about the Latent Aspect
Rating Analysis for Opinion Mining and
Sentiment Analysis.
In this lecture,
we're going to continue discussing
Opinion Mining and Sentiment Analysis.
In particular, we're going to introduce
Latent Aspect Rating Analysis
which allows us to perform detailed
analysis of reviews with overall ratings.
So, first is motivation.
Here are two reviews that you often
see in the net about the hotel.
And you see some overall ratings.
In this case,
both reviewers have given five stars.
And, of course,
there are also reviews that are in text.
Now, if you just look at these reviews,
it's not very clear whether the hotel is
good for its location or for its service.
It's also unclear why
a reviewer liked this hotel.
What we want to do is to
decompose this overall rating into
ratings on different aspects such as
value, rooms, location, and service.
So, if we can decompose
the overall ratings,
the ratings on these different aspects,
then, we
can obtain a more detailed understanding
of the reviewer's opinionsabout the hotel.
And this would also allow us to rank
hotels along different dimensions
such as value or rooms.
But, in general, such detailed
understanding will reveal more information
about the user's preferences,
reviewer's preferences.
And also, we can understand better
how the reviewers view this
hotel from different perspectives.
Now, not only do we want to
infer these aspect ratings,
we also want to infer the aspect weights.
So, some reviewers may care more about
values as opposed to the service.
And that would be a case.
like what's shown on the left for
the weight distribution,
where you can see a lot of
weight is places on value.
But others care more for service.
And therefore, they might place
more weight on service than value.
The reason why this is
also important is because,
do you think about a five star on value,
it might still be very expensive if the
reviewer cares a lot about service, right?
For this kind of service,
this price is good, so
the reviewer might give it a five star.
But if a reviewer really cares
about the value of the hotel,
then the five star, most likely,
would mean really cheap prices.
So, in order to interpret the ratings
on different aspects accurately,
we also need to know these aspect weights.
When they're combined together,
we can have a more detailed
understanding of the opinion.
So the task here is to get these reviews
and their overall ratings as input,
and then,
generate both the aspect ratings,
the compose aspect ratings, and
the aspect rates as output.
And this is a problem called
Latent Aspect Rating Analysis.
So the task, in general,
is given a set of review articles about
the topic with overall ratings, and
we hope to generate three things.
One is the major aspects
commented on in the reviews.
Second is ratings on each aspect,
such as value and room service.
And third is the relative weights placed
on different aspects by the reviewers.
And this task has a lot of applications,
and if you can do this,
and it will enable a lot of applications.
I just listed some here.
And later, I will show you some results.
And, for example,
we can do opinion based entity ranking.
We can generate an aspect-level
opinion summary.
We can also analyze reviewers preferences,
compare them or
compare their preferences
on different hotels.
And we can do personalized
recommendations of products.
So, of course, the question is
how can we solve this problem?
Now, as in other cases of
these advanced topics,
we won’t have time to really
cover the technique in detail.
But I’m going to give you a brisk,
basic introduction to the technique
development for this problem.
So, first step, we’re going to talk about
how to solve the problem in two stages.
Later, we’re going to also mention that
we can do this in the unified model.
Now, take this review with
the overall rating as input.
What we want to do is, first,
we're going to segment the aspects.
So we're going to pick out what words
are talking about location, and
what words are talking
about room condition, etc.
So with this, we would be able
to obtain aspect segments.
In particular, we're going to
obtain the counts of all the words
in each segment, and
this is denoted by C sub I of W and D.
Now this can be done by using seed
words like location and room or
price to retrieve
the [INAUDIBLE] in the segments.
And then, from those segments,
we can further mine correlated
words with these seed words and
that would allow us to segmented
the text into segments,
discussing different aspects.
But, of course,
later, as we will see, we can also use
[INAUDIBLE] models to do the segmentation.
But anyway, that's the first stage,
where the obtain the council
of words in each segment.
In the second stage,
which is called Latent Rating Regression,
we're going to use these words and
their frequencies in different
aspects to predict the overall rate.
And this predicting happens in two stages.
In the first stage,
we're going to use the [INAUDIBLE] and
the weights of these words in each
aspect to predict the aspect rating.
So, for example, if in your discussion
of location, you see a word like,
amazing, mentioned many times,
and it has a high weight.
For example, here, 3.9.
Then, it will increase
the Aspect Rating for location.
But, another word like, far,
which is an acted weight,
if it's mentioned many times,
and it will decrease the rating.
So the aspect ratings, assume that it
will be a weighted combination of these
word frequencies where the weights
are the sentiment weights of the words.
Of course, these sentimental weights
might be different for different aspects.
So we have, for each aspect, a set of
term sentiment weights as shown here.
And that's in order by beta sub I and W.
In the second stage or second step,
we're going to assume that the overall
rating is simply a weighted
combination of these aspect ratings.
So we're going to assume we have aspect
weights to the [INAUDIBLE] sub i of d,
and this will be used to take a weighted
average of the aspect ratings,
which are denoted by r sub i of d.
And we're going to assume the overall
rating is simply a weighted
average of these aspect ratings.
So this set up allows us to predict
the overall rating based on
the observable frequencies.
So on the left side,
you will see all these observed
information, the r sub d and the count.
But on the right side,
you see all the information in
that range is actually latent.
So, we hope to discover that.
Now, this is a typical case of
a generating model where would embed
the interesting variables
in the generated model.
And then, we're going to set up
a generation probability for
the overall rating given
the observed words.
And then, of course, we can adjust these
parameter values including betas Rs and
alpha Is in order to maximize
the probability of the data.
In this case, the conditional probability
of the observed rating given the document.
So we have seen such cases before in, for
example, PISA,
where we predict a text data.
But here, we're predicting the rating,
and the parameters,
of course, are very different.
But we can see, if we can uncover
these parameters, it would be nice,
because r sub i of d is precise as
the ratings that we want to get.
And these are the composer
ratings on different aspects.
[INAUDIBLE] sub I D is precisely
the aspect weights that we
hope to get as a byproduct,
that we also get the beta factor, and
these are the [INAUDIBLE] factor,
the sentiment weights of words.
So more formally,
the data we are modeling here is a set of
review documents with overall ratings.
And each review document denote by a d,
and the overall ratings denote by r sub d.
And d pre-segments turn
into k aspect segments.
And we're going to use ci(w,d) to denote
the count of word w in aspect segment i.
Of course, it's zero if the word
doesn't occur in the segment.
Now, the model is going to
predict the rating based on d.
So, we're interested in the provisional
problem of r sub-d given d.
And this model is set up as follows.
So r sub-d is assumed the two
follow a normal distribution
doesn't mean that denotes
actually await the average
of the aspect of ratings r
Sub I of d as shown here.
This normal distribution is
a variance of data squared.
Now, of course,
this is just our assumption.
The actual rating is not necessarily
anything thing this way.
But as always, when we make this
assumption, we have a formal way to
model the problem and that allows us
to compute the interest in quantities.
In this case, the aspect ratings and
the aspect weights.
Now, the aspect rating as
you see on the [INAUDIBLE]
is assuming that will be
a weight of sum of these weights.
Where the weight is just
the [INAUDIBLE] of the weight.
So as I said,
the overall rating is assumed to be
a weighted average of aspect ratings.
Now, these other values, r for
sub I of D, or denoted together
by other vector that depends on D is
that the token of specific weights.
And we’re going to assume that
this vector itself is drawn
from another Multivariate Gaussian
distribution,
with mean denoted by a Mu factor,
and covariance metrics sigma here.
Now, so this means, when we generate our
overall rating, we're going to first draw
a set of other values from this
Multivariate Gaussian Prior distribution.
And once we get these other values,
we're going to use then the weighted
average of aspect ratings as
the mean here to use the normal
distribution to generate
the overall rating.
Now, the aspect rating, as I just said,
is the sum of the sentiment weights of
words in aspect, note that here the
sentiment weights are specific to aspect.
So, beta is indexed by i,
and that's for aspect.
And that gives us a way to model
different segment of a word.
This is neither because
the same word might have
positive sentiment for another aspect.
It's also used for see what parameters
we have here beta sub i and
w gives us the aspect-specific
sentiment of w.
So, obviously,
that's one of the important parameters.
But, in general, we can see we have these
parameters, beta values, the delta,
and the Mu, and sigma.
So, next, the question is, how can
we estimate these parameters and, so
we collectively denote all
the parameters by lambda here.
Now, we can, as usual,
use the maximum likelihood estimate, and
this will give us the settings
of these parameters,
that with a maximized observed ratings
condition of their respective reviews.
And of, course,
this would then give us all the useful
variables that we
are interested in computing.
So, more specifically, we can now,
once we estimate the parameters,
we can easily compute the aspect rating,
for aspect the i or sub i of d.
And that's simply to take all of the words
that occurred in the segment, i,
and then take their counts and
then multiply that by the center of
the weight of each word and take a sum.
So, of course, this time would be zero for
words that are not occurring in and
that's why were going to take the sum
of all the words in the vocabulary.
Now what about the s factor weights?
Alpha sub i of d, well,
it's not part of our parameter.
Right?
So we have to use that to compute it.
And in this case, we can use the Maximum
a Posteriori to compute this alpha value.
Basically, we're going to maximize the
product of the prior of alpha according
to our assumed Multivariate Gaussian
Distribution and the likelihood.
In this case,
the likelihood rate is the probability of
generating this observed overall rating
given this particular alpha value and
some other parameters, as you see here.
So for more details about this model,
you can read this paper cited here.
[MUSIC]

[SOUND] This lecture is a continued
discussion of
Latent Aspect Rating Analysis.
Earlier, we talked about how to solve
the problem of LARA in two stages.
But we first do segmentation
of different aspects.
And then we use a latent regression
model to learn the aspect ratings and
then later the weight.
Now it's also possible to develop
a unified generative model for
solving this problem, and
that is we not only model the generational
over-rating based on text.
We also model the generation of text,
and so
a natural solution would
be to use topic model.
So given the entity,
we can assume there are aspects that
are described by word distributions.
Topics.
And then we an use a topic model to model
the generation of the reviewed text.
I will assume words in the review text
are drawn from these distributions.
In the same way as we assumed for
generating model like PRSA.
And then we can then plug in
the latent regression model to
use the text to further
predict the overrating.
And that means when we first
predict the aspect rating and
then combine them with aspect weights
to predict the overall rating.
So this would give us
a unified generated model,
where we model both the generation of text
and the overall ready condition on text.
So we don't have time to discuss
this model in detail as in
many other cases in this part of the cause
where we discuss the cutting edge topics,
but there's a reference site here
where you can find more details.
So now I'm going to show you some
simple results that you can get
by using these kind of generated models.
First, it's about rating decomposition.
So here, what you see
are the decomposed ratings for
three hotels that have
the same overall rating.
So if you just look at the overall rating,
you can't really tell much
difference between these hotels.
But by decomposing these
ratings into aspect ratings
we can see some hotels have higher
ratings for some dimensions,
like value, but others might score better
in other dimensions, like location.
And so this can give you detailed
opinions at the aspect level.
Now here, the ground-truth is
shown in the parenthesis, so
it also allows you to see whether
the prediction is accurate.
It's not always accurate but It's mostly
still reflecting some of the trends.
The second result you compare
different reviewers on the same hotel.
So the table shows the decomposed ratings
for two reviewers about same hotel.
Again their high level
overall ratings are the same.
So if you just look at the overall
ratings, you don't really get that much
information about the difference
between the two reviewers.
But after you decompose the ratings,
you can see clearly that they have
high scores on different dimensions.
So this shows that model can review
differences in opinions of different
reviewers and such a detailed
understanding can help us understand
better about reviewers and also better
about their feedback on the hotel.
This is something very interesting,
because this is in some
sense some byproduct.
In our problem formulation,
we did not really have to do this.
But the design of the generating
model has this component.
And these are sentimental weights for
words in different aspects.
And you can see the highly weighted words
versus the negatively loaded weighted
words here for
each of the four dimensions.
Value, rooms, location, and cleanliness.
The top words clearly make sense, and
the bottom words also make sense.
So this shows that with this approach,
we can also learn sentiment
information directly from the data.
Now, this kind of lexicon is very useful
because in general, a word like long,
let's say, may have different sentiment
polarities for different context.
So if I say the battery life of this
laptop is long, then that's positive.
But if I say the rebooting time for
the laptop is long, that's bad, right?
So even for
reviews about the same product, laptop,
the word long is ambiguous, it could
mean positive or it could mean negative.
But this kind of lexicon, that we can
learn by using this kind of generated
models, can show whether a word is
positive for a particular aspect.
So this is clearly very useful, and in
fact such a lexicon can be directly used
to tag other reviews about hotels or
tag comments about hotels in
social media like Tweets.
And what's also interesting is that since
this is almost completely unsupervised,
well assuming the reviews whose
overall rating are available And
then this can allow us to learn form
potentially larger amount of data on
the internet to reach sentiment lexicon.
And here are some results to
validate the preference words.
Remember the model can infer wether
a reviewer cares more about service or
the price.
Now how do we know whether
the inferred weights are correct?
And this poses a very difficult
challenge for evaluation.
Now here we show some
interesting way of evaluating.
What you see here are the prices
of hotels in different cities, and
these are the prices of hotels that are
favored by different groups of reviewers.
The top ten are the reviewers
was the highest
inferred value to other aspect ratio.
So for example value versus location,
value versus room, etcetera.
Now the top ten of the reviewers that
have the highest ratios by this measure.
And that means these reviewers
tend to put a lot of
weight on value as compared
with other dimensions.
So that means they really
emphasize on value.
The bottom ten on the other
hand of the reviewers.
The lowest ratio, what does that mean?
Well it means these reviewers have
put higher weights on other aspects
than value.
So those are people that cared about
another dimension and they didn't care so
much the value in some sense, at least
as compared with the top ten group.
Now these ratios are computer based on
the inferred weights from the model.
So now you can see the average prices
of hotels favored by top ten reviewers
are indeed much cheaper than those
that are favored by the bottom ten.
And this provides some indirect way
of validating the inferred weights.
It just means the weights are not random.
They are actually meaningful here.
In comparison,
the average price in these three cities,
you can actually see the top ten
tend to have below average in price,
whereas the bottom half, where they care
a lot about other things like a service or
room condition tend to have hotels
that have higher prices than average.
So with these results we can build
a lot of interesting applications.
For example, a direct application would be
to generate the rated aspect, the summary,
and because of the decomposition we
have now generated the summaries for
each aspect.
The positive sentences the negative
sentences about each aspect.
It's more informative than original review
that just has an overall rating and
review text.
Here are some other results
about the aspects that's covered
from reviews with no ratings.
These are mp3 reviews,
and these results show that the model
can discover some interesting aspects.
Commented on low overall ratings versus
those higher overall per ratings.
And they care more about
the different aspects.
Or they comment more on
the different aspects.
So that can help us discover for
example, consumers'
trend in appreciating different
features of products.
For example, one might have discovered
the trend that people tend to
like larger screens of cell phones or
light weight of laptop, etcetera.
Such knowledge can be useful for
manufacturers to design their
next generation of products.
Here are some interesting results
on analyzing users rating behavior.
So what you see is average weights
along different dimensions by
different groups of reviewers.
And on the left side you see the weights
of viewers that like the expensive hotels.
They gave the expensive hotels 5 Stars,
and
you can see their average rates
tend to be more for some service.
And that suggests that people like
expensive hotels because of good service,
and that's not surprising.
That's also another way to
validate it by inferred weights.
If you look at the right side where,
look at the column of 5 Stars.
These are the reviewers that
like the cheaper hotels, and
they gave cheaper hotels five stars.
As we expected and
they put more weight on value,
and that's why they like
the cheaper hotels.
But if you look at the, when they didn't
like expensive hotels, or cheaper hotels,
then you'll see that they tended to
have more weights on the condition of
the room cleanness.
So this shows that by using this model,
we can infer some
information that's very hard to obtain
even if you read all the reviews.
Even if you read all the reviews it's
very hard to infer such preferences or
such emphasis.
So this is a case where text mining
algorithms can go beyond what
humans can do, to review
interesting patterns in the data.
And this of course can be very useful.
You can compare different hotels,
compare the opinions from different
consumer groups, in different locations.
And of course, the model is general.
It can be applied to any
reviews with overall ratings.
So this is a very useful
technique that can
support a lot of text mining applications.
Finally the results of applying this
model for personalized ranking or
recommendation of entities.
So because we can infer the reviewers
weights on different dimensions,
we can allow a user to actually
say what do you care about.
So for example, I have a query
here that shows 90% of the weight
should be on value and 10% on others.
So that just means I don't
care about other aspect.
I just care about getting a cheaper hotel.
My emphasis is on the value dimension.
Now what we can do with such query
is we can use reviewers that we
believe have a similar preference
to recommend a hotels for you.
How can we know that?
Well, we can infer the weights of
those reviewers on different aspects.
We can find the reviewers whose
weights are more precise,
of course inferred rates
are similar to yours.
And then use those reviewers to
recommend hotels for you and
this is what we call personalized or
rather query specific recommendations.
Now the non-personalized
recommendations now shown on the top,
and you can see the top results generally
have much higher price, than the lower
group and that's because when the
reviewer's cared more about the value as
dictated by this query they tended
to really favor low price hotels.
So this is yet
another application of this technique.
It shows that by doing text mining
we can understand the users better.
And once we can handle users better
we can solve these users better.
So to summarize our discussion
of opinion mining in general,
this is a very important topic and
with a lot of applications.
And as a text sentiment
analysis can be readily done by
using just text categorization.
But standard technique
tends to not be enough.
And so we need to have enriched
feature implementation.
And we also need to consider
the order of those categories.
And we'll talk about ordinal
regression for some of these problem.
We have also assume that
the generating models are powerful for
mining latent user preferences.
This in particular in the generative
model for mining latent regression.
And we embed some interesting
preference information and
send the weights of words in the model
as a result we can learn most
useful information when
fitting the model to the data.
Now most approaches have been proposed and
evaluated.
For product reviews, and that was because
in such a context, the opinion holder and
the opinion target are clear.
And they are easy to analyze.
And there, of course,
also have a lot of practical applications.
But opinion mining from news and
social media is also important, but that's
more difficult than analyzing review data,
mainly because the opinion holders and
opinion targets are all interested.
So that calls for
natural management processing
techniques to uncover them accurately.
Here are some suggested readings.
The first two are small books that
are of some use of this topic,
where you can find a lot of discussion
about other variations of the problem and
techniques proposed for
solving the problem.
The next two papers about
generating models for
rating the aspect rating analysis.
The first one is about solving
the problem using two stages, and
the second one is about a unified model
where the topic model is integrated
with the regression model to solve
the problem using a unified model.
[MUSIC]

[SOUND] This lecture is about
the Text-Based Prediction.
In this lecture, we're going to
start talking about the mining
a different kind of knowledge,
as you can see here on this slide.
Namely we're going to use text
data to infer values of some other
variables in the real world that may
not be directly related to the text.
Or only remotely related to text data.
So this is very different
from content analysis or
topic mining where we directly
characterize the content of text.
It's also different from opinion mining or
sentiment analysis,
which still have to do is
characterizing mostly the content.
Only that we focus more
on the subject of content
which reflects what we know
about the opinion holder.
But this only provides limited
review of what we can predict.
In this lecture and the following
lectures, we're going to talk more about
how we can predict more
Information about the world.
How can we get the sophisticated patterns
of text together with other kind of data?
It would be useful first to take a look
at the big picture of prediction, and
data mining in general, and
I call this data mining loop.
So the picture that you are seeing right
now is that there are multiple sensors,
including human sensors,
to report what we have seen in
the real world in the form of data.
Of course the data in the form
of non-text data, and text data.
And our goal is to see if we
can predict some values of
important real world
variables that matter to us.
For example, someone's house condition,
or the weather, or etc.
And so these variables would be important
because we might want to act on that.
We might want to make
decisions based on that.
So how can we get from the data
to these predicted values?
Well in general we'll first have to do
data mining and analysis of the data.
Because we, in general, should treat
all the data that we collected
in such a prediction problem set up.
We are very much interested in
joint mining of non-text and
text data, which should
combine all the data together.
And then, through analysis,
generally there
are multiple predictors of this
interesting variable to us.
And we call these features.
And these features can then be
put into a predictive model,
to actually predict the value
of any interesting variable.
So this then allows us
to change the world.
And so
this basically is the general process for
making a prediction based on data,
including the test data.
Now it's important to emphasize
that a human actually
plays a very important
role in this process.
Especially because of
the involvement of text data.
So human first would be involved
in the mining of the data.
It would control the generation
of these features.
And it would also help us
understand the text data,
because text data are created
to be consumed by humans.
Humans are the best in consuming or
interpreting text data.
But when there are, of course, a lot of
text data then machines have to help and
that's why we need to do text data mining.
Sometimes machines can see patterns in
a lot of data that humans may not see.
But in general human would
play an important role in
analyzing some text data, or applications.
Next, human also must be involved
in predictive model building and
adjusting or testing.
So in particular, we will have a lot
of domain knowledge about the problem
of prediction that we can build
into this predictive model.
And then next, of course, when we have
predictive values for the variables,
then humans would be involved in
taking actions to change a word or
make decisions based on
these particular values.
And finally it's interesting
that a human could be involved
in controlling the sensors.
And this is so that we can
adjust to the sensors to collect
the most useful data for prediction.
So that's why I call
this data mining loop.
Because as we perturb the sensors,
it'll collect the new data and
more useful data then we will
obtain more data for prediction.
And this data generally will help
us improve the predicting accuracy.
And in this loop,
humans will recognize what additional
data will need to be collected.
And machines, of course,
help humans identify what data
should be collected next.
In general, we want to collect data
that is most useful for learning.
And there was actually a subarea in
machine learning called active learning
that has to do with this.
How do you identify data
points that would be most helpful
in machine learning programs?
If you can label them, right?
So, in general,
you can see there is a loop here from
data acquisition to data analysis.
Or data mining to prediction of values.
And to take actions to change the word,
and then observe what happens.
And then you can then
decide what additional data
have to be collected by
adjusting the sensors.
Or from the prediction arrows,
you can also note what additional data
we need to acquire in order to
improve the accuracy of prediction.
And this big picture is
actually very general and
it's reflecting a lot of important
applications of big data.
So, it's useful to keep that in mind
while we are looking at some text
mining techniques.
So from text mining perspective and
we're interested in text based prediction.
Of course, sometimes texts
alone can make predictions.
And this is most useful for
prediction about human behavior or
human preferences or opinions.
But in general text data will be
put together as non-text data.
So the interesting questions
here would be, first,
how can we design effective predictors?
And how do we generate such
effective predictors from text?
And this question has been addressed to
some extent in some previous lectures
where we talked about what kind of
features we can design for text data.
And it has also been
addressed to some extent by
talking about the other knowledge
that we can mine from text.
So, for example, topic mining can be very
useful to generate the patterns or topic
based indicators or predictors that can
be further fed into a predictive model.
So topics can be intermediate
recognition of text.
That would allow us to do
design high level features or
predictors that are useful for
prediction of some other variable.
It may be also generated from original
text data, it provides a much better
implementation of the problem and
it serves as more effective predictors.
And similarly similar analysis can
lead to such predictors, as well.
So, those other data mining or
text mining algorithms can be
used to generate predictors.
The other question is, how can we join
the mine text and non-text data together?
Now, this is a question that
we have not addressed yet.
So, in this lecture,
and in the following lectures,
we're going to address this problem.
Because this is where we can generate much
more enriched features for prediction.
And allows us to review a lot of
interesting knowledge about the world.
These patterns that
are generated from text and
non-text data themselves can sometimes,
already be useful for prediction.
But, when they are put together
with many other predictors
they can really help
improving the prediction.
Basically, you can see text-based
prediction can actually serve as a unified
framework to combine many text mining and
analysis techniques.
Including topic mining and any content
mining techniques or segment analysis.
The goal here is mainly to evoke
values of real-world variables.
But in order to achieve the goal
we can do some other preparations.
And these are subtasks.
So one subtask could mine the content
of text data, like topic mining.
And the other could be to mine
knowledge about the observer.
So sentiment analysis, opinion.
And both can help provide predictors for
the prediction problem.
And of course we can also add non-text
data directly to the predicted model, but
then non-text data also helps
provide a context for text analyst.
And that further improves the topic
mining and the opinion analysis.
And such improvement often leads to more
effective predictors for our problems.
It would enlarge the space of patterns
of opinions of topics that we can
mine from text and
that we'll discuss more later.
So the joint analysis of text and
non-text data can be actually
understood from two perspectives.
One perspective,
we have non-text can help with testimony.
Because non-text data can
provide a context for
mining text data provide a way to
partition data in different ways.
And this leads to a number of type of
techniques for contextual types of mining.
And that's the mine text in
the context defined by non-text data.
And you see this reference here, for
a large body of work, in this direction.
And I will need to highlight some of them,
in the next lectures.
Now, the other perspective is text data
can help with non-text
data mining as well.
And this is because text
data can help interpret
patterns discovered from non-text data.
Let's say you discover some frequent
patterns from non-text data.
Now we can use the text data
associated with instances
where the pattern occurs as well as
text data that is associated with
instances where the pattern
doesn't look up.
And this gives us two sets of text data.
And then we can see what's the difference.
And this difference in text data is
interpretable because text content is
easy to digest.
And that difference might
suggest some meaning for
this pattern that we
found from non-text data.
So, it helps interpret such patterns.
And this technique is
called pattern annotation.
And you can see this reference
listed here for more detail.
So here are the references
that I just mentioned.
The first is reference for
pattern annotation.
The second is, Qiaozhu Mei's
dissertation on contextual text mining.
It contains a large body of work on
contextual text mining techniques.
[MUSIC]

[SOUND]
This
lecture is about
the contextual text mining.
Contextual text mining
is related to multiple
kinds of knowledge that we mine from
text data, as I'm showing here.
It's related to topic mining because you
can make topics associated with context,
like time or location.
And similarly, we can make opinion
mining more contextualized,
making opinions connected to context.
It's related to text based prediction
because it allows us to combine non-text
data with text data to derive
sophisticated predictors for
the prediction problem.
So more specifically, why are we
interested in contextual text mining?
Well, that's first because text
often has rich context information.
And this can include direct context such
as meta-data, and also indirect context.
So, the direct context can grow
the meta-data such as time,
location, authors, and
source of the text data.
And they're almost always available to us.
Indirect context refers to additional
data related to the meta-data.
So for example, from office,
we can further obtain additional
context such as social network of
the author, or the author's age.
Such information is not in general
directly related to the text, yet
through the process, we can connect them.
There could be other text
data from the same source,
as this one through the other text can
be connected with this text as well.
So in general, any related data
can be regarded as context.
So there could be removed or
rated for context.
And so what's the use?
What is text context used for?
Well, context can be used to partition
text data in many interesting ways.
It can almost allow us to partition
text data in other ways as we need.
And this is very important
because this allows
us to do interesting comparative analyses.
It also in general,
provides meaning to the discovered topics,
if we associate the text with context.
So here's illustration of how context
can be regarded as interesting
ways of partitioning of text data.
So here I just showed some research
papers published in different years.
On different venues,
different conference names here listed on
the bottom like the SIGIR or ACL, etc.
Now such text data can be partitioned
in many interesting ways
because we have context.
So the context here just includes time and
the conference venues.
But perhaps we can include
some other variables as well.
But let's see how we can partition
this interesting of ways.
First, we can treat each
paper as a separate unit.
So in this case, a paper ID and the,
each paper has its own context.
It's independent.
But we can also treat all the papers
within 1998 as one group and
this is only possible because
of the availability of time.
And we can partition data in this way.
This would allow us to compare topics for
example, in different years.
Similarly, we can partition
the data based on the menus.
We can get all the SIGIR papers and
compare those papers with the rest.
Or compare SIGIR papers with KDD papers,
with ACL papers.
We can also partition the data to obtain
the papers written by authors in the U.S.,
and that of course,
uses additional context of the authors.
And this would allow us to then
compare such a subset with
another set of papers written
by also seen in other countries.
Or we can obtain a set of
papers about text mining, and
this can be compared with
papers about another topic.
And note that these
partitionings can be also
intersected with each other to generate
even more complicated partitions.
And so in general, this enables
discovery of knowledge associated with
different context as needed.
And in particular,
we can compare different contexts.
And this often gives us
a lot of useful knowledge.
For example, comparing topics over time,
we can see trends of topics.
Comparing topics in different
contexts can also reveal differences
about the two contexts.
So there are many interesting questions
that require contextual text mining.
Here I list some very specific ones.
For example, what topics have
been getting increasing attention
recently in data mining research?
Now to answer this question,
obviously we need to analyze
text in the context of time.
So time is context in this case.
Is there any difference in the responses
of people in different regions
to the event, to any event?
So this is a very broad
an answer to this question.
In this case of course,
location is the context.
What are the common research
interests of two researchers?
In this case, authors can be the context.
Is there any difference in the research
topics published by authors in the USA and
those outside?
Now in this case,
the context would include the authors and
their affiliation and location.
So this goes beyond just
the author himself or herself.
We need to look at the additional
information connected to the author.
Is there any difference in the opinions
of all the topics expressed on
one social network and another?
In this case, the social network of
authors and the topic can be a context.
Other topics in news data that
are correlated with sudden changes in
stock prices.
In this case, we can use a time series
such as stock prices as context.
What issues mattered in the 2012
presidential campaign, or
presidential election?
Now in this case,
time serves again as context.
So, as you can see,
the list can go on and on.
Basically, contextual text mining
can have many applications.
[MUSIC]

[MUSIC]
This lecture is about
a specific technique for
Contextual Text Mining called Contextual
Probabilistic Latent Semantic Analysis.
In this lecture, we're going to continue
discussing Contextual Text Mining.
And we're going to introduce Contextual
Probablitistic Latent Semantic Analysis
as exchanging of POS for
doing contextual text mining.
Recall that in contextual text mining
we hope to analyze topics in text,
in consideration of the context so
that we can associate the topics with a
property of the context were interesting.
So in this approach, contextual
probabilistic latent semantic analysis,
or CPLSA, the main idea is to
express to the add interesting
context variables into a generating model.
Recall that before when we generate
the text we generally assume we'll start
wIth some topics, and
then assemble words from some topics.
But here, we're going to add context
variables, so that the coverage of topics,
and also the content of topics
would be tied in context.
Or in other words, we're going to let
the context Influence both coverage and
the content of a topic.
The consequences that this will enable
us to discover contextualized topics.
Make the topics more interesting,
more meaningful.
Because we can then have topics
that can be interpreted as
specifically to a particular
context that we are interested in.
For example, a particular time period.
As an extension of PLSA model,
CPLSA does the following changes.
Firstly it would model the conditional
likelihood of text given context.
That clearly suggests that the generation
of text would then depend on context,
and that allows us to bring
context into the generative model.
Secondly, it makes two specific
assumptions about the dependency
of topics on context.
One is to assume that depending on
the context, depending on different time
periods or different locations, we assume
that there are different views of a topic
or different versions of word
descriptions that characterize a topic.
And this assumption allows
us to discover different
variations of the same topic
in different contexts.
The other is that we assume the topic
coverage also depends on the context.
That means depending on the time or
location, we might cover
topics differently.
Again, this dependency
would then allow us to
capture the association of
topics with specific contexts.
We can still use the EM algorithm to solve
the problem of parameter estimation.
And in this case, the estimated parameters
would naturally contain context variables.
And in particular,
a lot of conditional probabilities
of topics given certain context.
And this is what allows you
to do contextual text mining.
So this is the basic idea.
Now, we don't have time to
introduce this model in detail,
but there are references here that you
can look into to know more detail.
Here I just want to explain the high
level ideas in more detail.
Particularly I want to explain
the generation process.
Of text data that has context
associated in such a model.
So as you see here, we can assume
there are still multiple topics.
For example, some topics might represent
a themes like a government response,
donation Or the city of New Orleans.
Now this example is in the context
of Hurricane Katrina and
that hit New Orleans.
Now as you can see we
assume there are different
views associated with each of the topics.
And these are shown as View 1,
View 2, View 3.
Each view is a different
version of word distributions.
And these views are tied
to some context variables.
For example, tied to the location Texas,
or the time July 2005,
or the occupation of the author
being a sociologist.
Now, on the right side, now we assume
the document has context information.
So the time is known to be July 2005.
The location is Texas, etc.
And such context information is
what we hope to model as well.
So we're not going to just model the text.
And so one idea here is to model
the variations of top content and
various content.
And this gives us different views
of the water distributions.
Now on the bottom you will see the theme
coverage of top Coverage might also vary
according to these context
because in the case
of a location like Texas, people might
want to cover the red topics more.
That's New Orleans.
That's visualized here.
But in a certain time period,
maybe Particular topic and
will be covered more.
So this variation is
also considered in CPLSA.
So to generate the searcher document With
context, with first also choose a view.
And this view of course now could
be from any of these contexts.
Let's say, we have taken this
view that depends on the time.
In the middle.
So now, we will have a specific
version of word distributions.
Now, you can see some probabilities
of words for each topic.
Now, once we have chosen a view,
now the situation will be very similar
to what happened in standard ((PRSA))
We assume we have got word distribution
associated with each topic, right?
And then next, we will also choose
a coverage from the bottom, so
we're going to choose a particular
coverage, and that coverage,
before is fixed in PLSA, and
assigned to a particular document.
Each document has just one
coverage distribution.
Now here, because we consider context, so
the distribution of topics or the coverage
of Topics can vary depending on the
context that has influenced the coverage.
So, for example,
we might pick a particular coverage.
Let's say in this case we picked
a document specific coverage.
Now with the coverage and
these word distributions
we can generate a document in
exactly the same way as in PLSA.
So what it means, we're going to
use the coverage to choose a topic,
to choose one of these three topics.
Let's say we have picked the yellow topic.
Then we'll draw a word from this
particular topic on the top.
Okay, so
we might get a word like government.
And then next time we might
choose a different topic, and
we'll get donate, etc.
Until we generate all the words.
And this is basically
the same process as in PLSA.
So the main difference is
when we obtain the coverage.
And the word distribution,
we let the context influence our choice So
in other words we have extra switches
that are tied to these contacts that will
control the choices of different views
of topics and the choices of coverage.
And naturally the model we have
more parameters to estimate.
But once we can estimate those
parameters that involve the context,
then we will be able to understand
the context specific views of topics,
or context specific coverages of topics.
And this is precisely what we
want in contextual text mining.
So here are some simple results.
From using such a model.
Not necessary exactly the same model,
but similar models.
So on this slide you see
some sample results of
comparing news articles about Iraq War and
Afghanistan War.
Now we have about 30 articles on Iraq
wa,r and 26 articles on Afghanistan war.
And in this case,
the goal is to review the common topic.
It's covered in both sets of articles and
the differences of variations of
the topic in each of the two collections.
So in this case the context is explicitly
specified by the topic or collection.
And we see the results here
show that there is a common
theme that's corresponding to
Cluster 1 here in this column.
And there is a common theme indicting that
United Nations is involved in both Wars.
It's a common topic covered
in both sets of articles.
And that's indicated by the high
probability words shown here, united and
nations.
Now if you know the background,
of course this is not surprising and
this topic is indeed very
relevant to both wars.
If you look at the column further and
then what's interesting's that the next
two cells of word
distributions actually tell us
collection specific variations
of the topic of United Nations.
So it indicates that the Iraq War,
United Nations was more involved
in weapons factions, whereas in
the Afghanistan War it was more involved
in maybe aid to Northern Alliance.
It's a different variation of
the topic of United Nations.
So this shows that by
bringing the context.
In this case different the walls or
different the collection of texts.
We can have topical variations
tied to these contexts,
to review the differences of coverage
of the United Nations in the two wars.
Now similarly if you look at
the second cluster Class two,
it has to do with the killing of people,
and, again,
it's not surprising if you know
the background about wars.
All the wars involve killing of people,
but
imagine if you are not familiar
with the text collections.
We have a lot of text articles, and
such a technique can reveal the common
topics covered in both sets of articles.
It can be used to review common topics
in multiple sets of articles as well.
If you look at of course in
that column of cluster two,
you see variations of killing of people
and that corresponds to different contexts
And here is another example of results
obtained from blog articles
about Hurricane Katrina.
In this case,
what you see here is visualization of
the trends of topics over time.
And the top one shows just
the temporal trends of two topics.
One is oil price, and one is about
the flooding of the city of New Orleans.
Now these topics are obtained from
blog articles about Hurricane Katrina.
And people talk about these topics.
And end up teaching to some other topics.
But the visualisation shows
that with this technique,
we can have conditional
distribution of time.
Given a topic.
So this allows us to plot
this conditional probability
the curve is like what you're seeing here.
We see that, initially, the two
curves tracked each other very well.
But later we see the topic of New Orleans
was mentioned again but oil price was not.
And this turns out to be
the time period when another hurricane,
hurricane Rita hit the region.
And that apparently triggered more
discussion about the flooding of the city.
The bottom curve shows
the coverage of this topic
about flooding of the city by block
articles in different locations.
And it also shows some shift of
coverage that might be related to
people's migrating from the state
of Louisiana to Texas for example.
So in this case we can see the time can
be used as context to review trends of
topics.
These are some additional
results on spacial patterns.
In this case it was about
the topic of government response.
And there was some criticism about
the slow response of government
in the case of Hurricane Katrina.
And the discussion now is
covered in different locations.
And these visualizations show the coverage
in different weeks of the event.
And initially it's covered
mostly in the victim states,
in the South, but then gradually
spread into other locations.
But in week four,
which is shown on the bottom left,
we see a pattern that's very similar
to the first week on the top left.
And that's when again
Hurricane Rita hit in the region.
So such a technique would allow
us to use location as context
to examine their issues of topics.
And of course the moral
is completely general so
you can apply this to any
other connections of text.
To review spatial temporal patterns.
His view found another application
of this kind of model,
where we look at the use of the model for
event impact analysis.
So here we're looking at the research
articles information retrieval.
IR, particularly SIGIR papers.
And the topic we are focusing on
is about the retrieval models.
And you can see the top words with high
probability about this model on the left.
And then we hope to examine
the impact of two events.
One is a start of TREC, for
Text and Retrieval Conference.
This is a major evaluation
sponsored by U.S.
government, and was launched in 1992 or
around that time.
And that is known to have made a impact on
the topics of research
information retrieval.
The other is the publication of
a seminal paper, by Croft and Porte.
This is about a language model
approach to information retrieval.
It's also known to have made a high
impact on information retrieval research.
So we hope to use this kind of
model to understand impact.
The idea here is simply to
use the time as context.
And use these events to divide
the time periods into a period before.
For the event and
another after this event.
And then we can compare
the differences of the topics.
The and the variations, etc.
So in this case,
the results show before track the study of
retrieval models was mostly a vector
space model, Boolean model etc.
But the after Trec,
apparently the study of retrieval models
have involved a lot of other words.
That seems to suggest some
different retrieval tasks, so for
example, email was used in
the enterprise search tasks and
subtopical retrieval was another
task later introduced by Trec.
On the bottom,
we see the variations that are correlated
with the propagation of
the language model paper.
Before, we have those classic
probability risk model,
logic model, Boolean etc., but after 1998,
we see clear dominance of language
model as probabilistic models.
And we see words like language model,
estimation of parameters, etc.
So this technique here can use events as
context to understand the impact of event.
Again the technique is generals so
you can use this to analyze
the impact of any event.
Here are some suggested readings.
The first is paper about simple staging of
psi to label cross-collection comparison.
It's to perform comparative
text mining to allow us to
extract common topics shared
by multiple collections.
And there are variations
in each collection.
The second one is the main
paper about the CPLSA model.
Was a discussion of a lot of applications.
The third one has a lot of details
about the special temporal patterns for
the Hurricane Katrina example.
[MUSIC]

[SOUND] This lecture is about
how to mine text data with
social network as context.
In this lecture we're going to continue
discussing contextual text mining.
In particular, we're going to look at
the social network of others as context.
So first, what's our motivation for using
network context for analysis of text?
The context of a text
article can form a network.
For example the authors
of research articles
might form collaboration networks.
But authors of social media content
might form social networks.
For example,
in Twitter people might follow each other.
Or in Facebook as people might
claim friends of others, etc.
So such context connects
the content of the others.
Similarly, locations associated with
text can also be connected to form
geographical network.
But in general you can can imagine
the metadata of the text data
can form some kind of network
if they have some relations.
Now there is some benefit in
jointly analyzing text and
its social network context or
network context in general.
And that's because we can use network to
impose some constraints on topics of text.
So for example it's reasonable
to assume that authors
connected in collaboration networks
tend to write about the similar topics.
So such heuristics can be used
to guide us in analyzing topics.
Text also can help characterize the
content associated with each subnetwork.
And this is to say that both
kinds of data, the network and
text, can help each other.
So for example the difference in
opinions expressed that are in
two subnetworks can be reviewed by
doing this type of joint analysis.
So here briefly you could use a model
called a network supervised topic model.
In this slide we're going to
give some general ideas.
And then in the next slide we're
going to give some more details.
But in general in this part of the course
we don't have enough time to cover
these frontier topics in detail.
But we provide references
that would allow you to
read more about the topic
to know the details.
But it should still be useful
to know the general ideas.
And to know what they can do to know
when you might be able to use them.
So the general idea of network
supervised topic model is the following.
Let's start with viewing
the regular topic models.
Like if you had an LDA as
sorting optimization problem.
Of course, in this case,
the optimization objective
function is a likelihood function.
So we often use maximum likelihood
estimator to obtain the parameters.
And these parameters will give us
useful information that we want to
obtain from text data.
For example, topics.
So we want to maximize the probability
of tests that are given the parameters
generally denoted by number.
The main idea of incorporating network is
to think about the constraints that
can be imposed based on the network.
In general,
the idea is to use the network to
impose some constraints on
the model parameters, lambda here.
For example,
the text at adjacent nodes of the network
can be similar to cover similar topics.
Indeed, in many cases,
they tend to cover similar topics.
So we may be able to smooth
the topic distributions
on the graph on the network so
that adjacent nodes will have
very similar topic distributions.
So they will share a common
distribution on the topics.
Or have just a slight variations of the
topic of distributions, of the coverage.
So, technically, what we can do
is simply to add a network and
use the regularizers to the likelihood
of objective function as shown here.
So instead of just optimize
the probability of test
data given parameters lambda, we're
going to optimize another function F.
This function combines the likelihood with
a regularizer function called R here.
And the regularizer defines
the the parameters lambda and the Network.
It tells us basically
what kind of parameters are preferred
from a network constraint perspective.
So you can easily see this is in effect
implementing the idea of imposing
some prior on the model parameters.
Only that we're not necessary
having a probabilistic model, but
the idea is the same.
We're going to combine the two in
one single objective function.
So, the advantage of this idea
is that it's quite general.
Here the top model can be any
generative model for text.
It doesn't have to be PLSA or
LEA, or the current topic models.
And similarly,
the network can be also in a network.
Any graph that connects
these text objects.
This regularizer can
also be any regularizer.
We can be flexible in capturing different
heuristics that we want to capture.
And finally,
the function F can also vary, so
there can be many different
ways to combine them.
So, this general idea is actually quite,
quite powerful.
It offers a general approach
to combining these different
types of data in single
optimization framework.
And this general idea can really
be applied for any problem.
But here in this paper reference here,
a particular instantiation
called a NetPLSA was started.
In this case, it's just for
instantiating of PLSA to incorporate this
simple constraint imposed by network.
And the prior here is the neighbors on
the network must have
similar topic distribution.
They must cover similar
topics in similar ways.
And that's basically
what it says in English.
So technically we just have
a modified objective function here.
Let's define both the texts you can
actually see in the network graph G here.
And if you look at this formula,
you can actually recognize
some part fairly familiarly.
Because they are, they should be
fairly familiar to you by now.
So can you recognize which
part is the likelihood for
the test given the topic model?
Well if you look at it, you will see this
part is precisely the PLSA log-likelihood
that we want to maximize when we
estimate parameters for PLSA alone.
But the second equation shows some
additional constraints on the parameters.
And in particular,
we'll see here it's to measure
the difference between the topic
coverage at node u and node v.
The two adjacent nodes on the network.
We want their distributions to be similar.
So here we are computing the square
of their differences and
we want to minimize this difference.
And note that there's a negative sign in
front of this sum, this whole sum here.
So this makes it possible to find
the parameters that are both to
maximize the PLSA log-likelihood.
That means the parameters
will fit the data well and,
also to respect that this
constraint from the network.
And this is the negative
sign that I just mentioned.
Because this is an negative sign,
when we maximize this
object in function we'll actually
minimize this statement term here.
So if we look further in
this picture we'll see
the results will weight of
edge between u and v here.
And that space from out network.
If you have a weight that says well,
these two nodes are strong
collaborators of researchers.
These two are strong connections
between two people in a social network.
And they would have weight.
Then that means it would be more important
that they're topic coverages are similar.
And that's basically what it says here.
And finally you see
a parameter lambda here.
This is a new parameter to control
the influence of network constraint.
We can see easily, if lambda is set to 0,
we just go back to the standard PLSA.
But when lambda is set to a larger value,
then we will let the network
influence the estimated models more.
So as you can see, the effect here is
that we're going to do basically PLSA.
But we're going to also try
to make the topic coverages
on the two nodes that are strongly
connected to be similar.
And we ensure their coverages are similar.
So here are some of the several results,
from that paper.
This is slide shows the record
results of using PLSA.
And the data here is DBLP data,
bibliographic data,
about research articles.
And the experiments have to do with
using four communities of applications.
IR information retrieval.
DM stands for data mining.
ML for machinery and web.
There are four communities of articles,
and we were hoping
to see that the topic mining can help
us uncover these four communities.
But from these assembled topics that you
have seen here that are generated by PLSA.
And PLSA is unable to generate
the four communities that
correspond to our intuition.
The reason was because they
are all mixed together and
there are many words that
are shared by these communities.
So it's not that easy to use
four topics to separate them.
If we use more topics,
perhaps we will have more coherent topics.
But what's interesting is that if we
use the NetPLSA where the network,
the collaboration network in this case of
authors is used to impose constraints.
And in this case we also use four topics.
But Ned Pierre said we gave
much more meaningful topics.
So here we'll see that these topics
correspond well to the four communities.
The first is information retrieval.
The second is data mining.
Third is machine learning.
And the fourth is web.
So that separation was mostly
because of the influence of network
where with leverage is
a collaboration network information.
Essentially the people that
form a collaborating network
would then be kind of assumed
to write about similar topics.
And that's why we're going to
have more coherent topics.
And if you just listen to text data
alone based on the occurrences,
you won't get such coherent topics.
Even though a topic model, like PLSA or
LDA also should be able to
pick up co-occurring words.
So in general the topics
that they generate represent
words that co-occur each other.
But still they cannot generate such
a coherent results as NetPLSA,
showing that the network
contest is very useful here.
Now a similar model could have been also
useful to to characterize the content
associated with each
subnetwork of collaborations.
So a more general view of text
mining in context of network is you
treat text as living in a rich
information network environment.
And that means we can connect all the
related data together as a big network.
And text data can be associated with
a lot of structures in the network.
For example, text data can be associated
with the nodes of the network, and
that's basically what we just
discussed in the NetPLSA.
But text data can be associated with age
as well, or paths or even subnetworks.
And such a way to represent texts
that are in the big environment of
all the context information
is very powerful.
Because it allows to analyze all the data,
all the information together.
And so in general, analysis of text
should be using the entire network
information that's
related to the text data.
So here's one suggested reading.
And this is the paper about NetPLSA where
you can find more details about the model
and how to make such a model.
[MUSIC]

[SOUND]
Hello.
Welcome to the course Text Mining and
Analytics.
My name is ChengXiang Zhai.
I have a nickname, Cheng.
I am a professor of the Department of
Computer Science at the University of
Illinois at Urbana-Champaign.
This course is a part of
a data mining specialization
offered by the University of
Illinois at Urbana-Champaign.
In addition to this course,
there are four other courses offered by
Professor Jiawei Han,
Professor John Hart and me, followed by
a capstone project course that
all of us will teach together.
This course is particularly related to
another course in the specialization,
mainly text retrieval and search engines
in that both courses are about text data.
In contrast, pattern discovery and
cluster analysis are about
algorithms more applicable to
all kinds of data in general.
The visualization course is also
relatively general in that the techniques
can be applied to all kinds of data.
This course addresses a pressing need for
harnessing big text data.
Text data has been growing
dramatically recently,
mostly because of the advance of
technologies deployed on the web
that would enable people to
quickly generate text data.
So, I listed some of
the examples on this slide
that can show a variety of text
data that are available today.
For example, if you think about
the data on the internet, on the web,
everyday we are seeing many
web pages being created.
Blogs are another kind
of new text data that
are being generated quickly by people.
Anyone can write a blog
article on the web.
New articles of course have always been
a main kind of text data that
being generated everyday.
Emails are yet another kind of text data.
And literature is also representing
a large portion of text data.
It's also especially very important
because of the high quality
in the data.
That is,
we encode our knowledge about the word
using text data represented by
all the literature articles.
It's a vast amount of knowledge of
all the text and
data in these literature articles.
Twitter is another representative
text data representing social media.
Of course there are forums as well.
People are generating tweets very quickly
indeed as we are speaking perhaps many
people have already written many tweets.
So, as you can see there
are all kinds of text data
that are being generated very quickly.
Now these text data present
some challenges for people.
It's very hard for anyone to
digest all the text data quickly.
In particular, it's impossible for
scientists to read all of the for
example or for
anyone to read all the tweets.
So there's a need for tools to help
people digest text data more efficiently.
There is also another
interesting opportunity
provided by such big text data, and
that is it's possible to leverage
the amount of text data to
discover interesting patterns to
turn text data into actionable knowledge
that can be useful for decision making.
So for example,
product managers may be interested
in knowing the feedback of
customers about their products,
knowing how well their
products are being received as
compared with the products of competitors.
This can be a good opportunity for
leveraging text data as we have seen
a lot of reviews of product on the web.
So if we can develop a master text
mining techniques to tap into such
a [INAUDIBLE] to extract the knowledge and
opinions of people about these products,
then we can help these product managers
to gain business intelligence or
to essentially feedback
from their customers.
In scientific research, for example,
scientists are interested in knowing
the trends of research topics, knowing
about what related fields have discovered.
This problem is especially important
in biology research as well.
Different communities tend to
use different terminologies, yet
they're starting very similar problems.
So how can we integrate the knowledge
that is covered in different communities
to help study a particular problem?
It's very important, and
it can speed up scientific discovery.
So there are many such examples
where we can leverage the text data
to discover useable knowledge
to optimize our decision.
The main techniques for
harnessing big text data are text
retrieval and text mining.
So these are two very much
related technologies.Yet,
they have somewhat different purposes.
These two kinds of techniques are covered
in the tool in this specialization.
So, text retrieval on search
engines covers text retrieval,
and this is necessary to
turn big text data into
a much smaller but more relevant text
data, which are often the data that
we need to handle a particular problem or
to optimize a particular decision.
This course covers text mining which
is a second step in this pipeline
that can be used to further process
the small amount of relevant data
to extract the knowledge or to help
people digest the text data easily.
So the two courses are clearly related,
in fact,
some of the techniques are shared by
both text retrieval and text mining.
If you have already taken the text
retrieval course, then you might see
some of the content being repeated
in this text mining course, although
we'll be talking about the techniques
from a very different perspective.
If you have not taken
the text retrieval course,
it's also fine because this
course is self-contained and
you can certainly understand all of
the materials without a problem.
Of course, you might find it
beneficial to take both courses and
that will give you a very complete set
of skills to handle big text data.
[MUSIC]

[SOUND]
This
lecture is a brief
introduction to the course.
We're going to cover the objectives
of the course, the prerequisites and
course formats, reference books and
how to complete the course.
The objectives of the course
are the following.
First, we would like to
cover the basic context and
practical techniques of text data mining.
So this means we will not be able to
cover some advanced techniques in detail,
but whether we choose
the practical use for
techniques and then treat them in order.
We're going to also cover the basic
concepts that are very useful for
many applications.
The second objective is to cover
more general techniques for
text or data mining, so
we emphasize the coverage of general
techniques that can be applicable to
any text in any natural language.
We also hope that these
techniques to either
automatically work on problems
without any human effort or
only requiring minimum human effort.
So these criteria have
helped others to choose
techniques that can be
applied to many applications.
This is in contrast to some more
detailed analysis of text data,
particularly using natural
language processing techniques.
Now such techniques
are also very important.
And they are indeed, necessary for
some of the applications,
where we would like to go in-depth to
understand text, they are in more detail.
Such detail in understanding techniques,
however,
are generally not scalable and they
tend to require a lot of human effort.
So they cannot be easy
to apply to any domain.
So as you can imagine in practice,
it would be beneficial to combine
both kinds of techniques using
the general techniques that we'll be
covering in this course as a basis and
improve these techniques by using more
human effort whenever it's appropriate.
We also would like to provide a hands-on
experience to you in multiple aspects.
First, you'll do some experiments
using a text mining toolkit and
implementing text mining algorithms.
Second, you will have opportunity to
experiment with some algorithms for
text mining and
analytics to try them on some datasets and
to understand how to do experiments.
And finally, you have opportunity
to participate in a competition
of text-based prediction task.
You're expected to know the basic
concepts of computer science.
For example, the data structures and
some other really basic
concepts in computer science.
You are also expected to be
familiar with programming and
comfortable with programming,
particularly with C++.
This course,
however is not about programming.
So you are not expected to
do a lot of coding, but
we're going to give you C++ toolkit
that's fairly sophisticated.
So you have to be comfortable
with handling such a toolkit and
you may be asked to write
a small amount of code.
It's also useful if you
know some concepts and
techniques in probability and
statistics, but it's not necessary.
Knowing such knowledge would help you
understand some of the algorithm in
more depth.
The format of the course is lectures
plus quizzes that will be given to you
in the regular basis and there is
also optional programming assignment.
Now, we've made programming
assignments optional.
Not because it's not important, but
because we suspect that the not
all of you will have the need for
computing resources to do
the program assignment.
So naturally,
we would encourage all of you to try to do
the program assignments,
if possible as that will be a great way
to learn about the knowledge
that we teach in this course.
There's no required reading for
this course,
but I was list some of
the useful reference books here.
So we expect you to be able to understand
all the essential materials by just
watching the actual videos and
you should be able to answer all the quiz
questions by just watching the videos.
But it's always good to read additional
books in the larger scope of knowledge,
so here is this the four books.
The first is a textbook about
statistical language processing.
Some of the chapters [INAUDIBLE]
are especially relevant to this course.
The second one is a textbook
about information retrieval,
but it has broadly covered
a number of techniques that
are really in the category
of text mining techniques.
So it's also useful, because of that.
The third book is actually
a collection of silly articles and
it has broadly covered all
the aspects of mining text data.
The mostly relevant chapters
are also listed here.
In these chapters, you can find
some in depth discussion of cutting
edge research on the topics that
we discussed in this course.
And the last one is actually
a book that Sean Massung and
I are currently writing and
we're going to make the rough
draft chapters available at
this URL listed right here.
You can also find additional
reference books and
other readings at the URL
listed at the bottom.
So finally, some information about how
to complete the course this
information is also on the web.
So I just briefly go over it and
you can complete the course by
earning one of the following badges.
One is Course Achievement Badge.
To earn that,
you have to have at least a 70%
average score on all the quizzes combined.
It does mean every quiz has to be 70% or
better.
The second batch here,
this is a Course Mastery Badge and
this just requires a higher score,
90% average score for the quizzes.
There are also three
optional programming badges.
I said earlier that we encourage you
to do programming assignments, but
they're not necessary,
they're not required.
The first is
Programming Achievement Badge.
This is similar to the call
switching from the badge.
Here would require you to get at least 70%
average score on programming assignments.
And similarly, the mastery badge
is given to those who can score
90% average score or better.
The last badge is
a Text Mining Competition Leader Badge and
this is given to those of you who
do well in the competition task.
And specifically, we're planning to give
the badge to the top
30% in the leaderboard.
[MUSIC]

[SOUND]
In
this lecture we give an overview
of Text Mining and Analytics.
First, let's define the term text mining,
and the term text analytics.
The title of this course is
called Text Mining and Analytics.
But the two terms text mining, and text
analytics are actually roughly the same.
So we are not really going to
really distinguish them, and
we're going to use them interchangeably.
But the reason that we have chosen to use
both terms in the title is because
there is also some subtle difference,
if you look at the two phrases literally.
Mining emphasizes more on the process.
So it gives us a error rate
medical view of the problem.
Analytics, on the other hand
emphasizes more on the result,
or having a problem in mind.
We are going to look at text
data to help us solve a problem.
But again as I said, we can treat
these two terms roughly the same.
And I think in the literature
you probably will find the same.
So we're not going to really
distinguish that in the course.
Both text mining and
text analytics mean that we
want to turn text data into high quality
information, or actionable knowledge.
So in both cases, we
have the problem of dealing with
a lot of text data and we hope to.
Turn these text data into something more
useful to us than the raw text data.
And here we distinguish
two different results.
One is high-quality information,
the other is actionable knowledge.
Sometimes the boundary between
the two is not so clear.
But I also want to say a little bit about
these two different angles of
the result of text field mining.
In the case of high quality information,
we refer to more
concise information about the topic.
Which might be much easier for
humans to digest than the raw text data.
For example, you might face
a lot of reviews of a product.
A more concise form of information
would be a very concise summary
of the major opinions about
the features of the product.
Positive about,
let's say battery life of a laptop.
Now this kind of results are very useful
to help people digest the text data.
And so this is to minimize a human effort
in consuming text data in some sense.
The other kind of output
is actually more knowledge.
Here we emphasize the utility
of the information or
knowledge we discover from text data.
It's actionable knowledge for some
decision problem, or some actions to take.
For example, we might be able to determine
which product is more appealing to us,
or a better choice for
a shocking decision.
Now, such an outcome could be
called actionable knowledge,
because a consumer can take the knowledge
and make a decision, and act on it.
So, in this case text mining supplies
knowledge for optimal decision making.
But again, the two are not so
clearly distinguished, so
we don't necessarily have
to make a distinction.
Text mining is also
related to text retrieval,
which is a essential component
in many text mining systems.
Now, text retrieval refers to
finding relevant information from
a large amount of text data.
So I've taught another separate book
on text retrieval and search engines.
Where we discussed various techniques for
text retrieval.
If you have taken that book,
and you will find some overlap.
And it will be useful To know
the background of text retrieval
of understanding some of
the topics in text mining.
But, if you have not taken that book,
it's also fine because in this book
on text mining and analytics, we're
going to repeat some of the key concepts
that are relevant for text mining.
But they're at the high level and
they also explain the relation between
text retrieval and text mining.
Text retrieval is very useful for
text mining in two ways.
First, text retrieval can be
a preprocessor for text mining.
Meaning that it can help
us turn big text data into
a relatively small amount
of most relevant text data.
Which is often what's needed for
solving a particular problem.
And in this sense, text retrieval
also helps minimize human effort.
Text retrieval is also needed for
knowledge provenance.
And this roughly corresponds
to the interpretation of text
mining as turning text data
into actionable knowledge.
Once we find the patterns in text data, or
actionable knowledge, we generally
would have to verify the knowledge.
By looking at the original text data.
So the users would have to have some text
retrieval support, go back to the original
text data to interpret the pattern or
to better understand an analogy or
to verify whether a pattern
is really reliable.
So this is a high level introduction
to the concept of text mining,
and the relationship between
text mining and retrieval.
Next, let's talk about text
data as a special kind of data.
Now it's interesting to
view text data as data
generated by humans as subjective sensors.
So, this slide shows an analogy
between text data and non-text data.
And between humans as
subjective sensors and
physical sensors,
such as a network sensor or a thermometer.
So in general a sensor would
monitor the real world in some way.
It would sense some signal
from the real world, and
then would report the signal as data,
in various forms.
For example, a thermometer would watch
the temperature of real world and
then we report the temperature
being a particular format.
Similarly, a geo sensor would sense
the location and then report.
The location specification, for
example, in the form of longitude
value and latitude value.
A network sends over
the monitor network traffic,
or activities in the network and
are reported.
Some digital format of data.
Similarly we can think of
humans as subjective sensors.
That will observe the real world and
from some perspective.
And then humans will express what they
have observed in the form of text data.
So, in this sense, human is actually
a subjective sensor that would also
sense what's happening in the world and
then express what's observed in the form
of data, in this case, text data.
Now, looking at the text data in
this way has an advantage of being
able to integrate all
types of data together.
And that's indeed needed in
most data mining problems.
So here we are looking at
the general problem of data mining.
And in general we would Be
dealing with a lot of data
about our world that
are related to a problem.
And in general it will be dealing with
both non-text data and text data.
And of course the non-text data
are usually produced by physical senses.
And those non-text data can
be also of different formats.
Numerical data, categorical,
or relational data,
or multi-media data like video or speech.
So, these non text data are often
very important in some problems.
But text data is also very important,
mostly because they contain
a lot of symmetrical content.
And they often contain
knowledge about the users,
especially preferences and
opinions of users.
So, but by treating text data as
the data observed from human sensors,
we can treat all this data
together in the same framework.
So the data mining problem is
basically to turn such data,
turn all the data in your actionable
knowledge to that we can take advantage
of it to change the real
world of course for better.
So this means the data mining problem is
basically taking a lot of data as input
and giving actionable knowledge as output.
Inside of the data mining module,
you can also see
we have a number of different
kind of mining algorithms.
And this is because, for
different kinds of data,
we generally need different algorithms for
mining the data.
For example,
video data might require computer
vision to understand video content.
And that would facilitate
the more effective mining.
And we also have a lot of general
algorithms that are applicable
to all kinds of data and those algorithms,
of course, are very useful.
Although, for a particular kind of data,
we generally want to also
develop a special algorithm.
So this course will cover
specialized algorithms that
are particularly useful for
mining text data.
[MUSIC]

[SOUND].
This lecture is about the syntagmatic
relation discovery, and entropy.
In this lecture, we're going to continue
talking about word association mining.
In particular, we're going to talk about
how to discover syntagmatic relations.
And we're going to start with
the introduction of entropy,
which is the basis for designing some
measures for discovering such relations.
By definition,
syntagmatic relations hold between words
that have correlated co-occurrences.
That means,
when we see one word occurs in context,
we tend to see the occurrence
of the other word.
So, take a more specific example, here.
We can ask the question,
whenever eats occurs,
what other words also tend to occur?
Looking at the sentences on the left,
we see some words that might occur
together with eats, like cat,
dog, or fish is right.
But if I take them out and
if you look at the right side where we
only show eats and some other words,
the question then is.
Can you predict what other words
occur to the left or to the right?
Right so
this would force us to think about what
other words are associated with eats.
If they are associated with eats,
they tend to occur in the context of eats.
More specifically our
prediction problem is to take
any text segment which can be a sentence,
a paragraph, or a document.
And then ask I the question,
is a particular word present or
absent in this segment?
Right here we ask about the word W.
Is W present or absent in this segment?
Now what's interesting is that
some words are actually easier
to predict than other words.
If you take a look at the three
words shown here, meat, the, and
unicorn, which one do you
think is easier to predict?
Now if you think about it for
a moment you might conclude that
the is easier to predict because
it tends to occur everywhere.
So I can just say,
well that would be in the sentence.
Unicorn is also relatively easy
because unicorn is rare, is very rare.
And I can bet that it doesn't
occur in this sentence.
But meat is somewhere in
between in terms of frequency.
And it makes it harder to predict because
it's possible that it occurs in a sentence
or the segment, more accurately.
But it may also not occur in the sentence,
so
now let's study this
problem more formally.
So the problem can be formally defined
as predicting the value of
a binary random variable.
Here we denote it by X sub w,
w denotes a word, so
this random variable is associated
with precisely one word.
When the value of the variable is 1,
it means this word is present.
When it's 0, it means the word is absent.
And naturally, the probabilities for
1 and 0 should sum to 1,
because a word is either present or
absent in a segment.
There's no other choice.
So the intuition with this concept earlier
can be formally stated as follows.
The more random this random variable is,
the more difficult the prediction will be.
Now the question is how does one
quantitatively measure the randomness of
a random variable like X sub w?
How in general, can we quantify
the randomness of a variable and
that's why we need a measure
called entropy and
this measure introduced in information
theory to measure the randomness of X.
There is also some connection
with information here but
that is beyond the scope of this course.
So for
our purpose we just treat entropy function
as a function defined
on a random variable.
In this case, it is a binary random
variable, although the definition can
be easily generalized for
a random variable with multiple values.
Now the function form looks like this,
there's the sum of all the possible
values for this random variable.
Inside the sum for each value we
have a product of the probability
that the random variable equals this
value and log of this probability.
And note that there is also
a negative sign there.
Now entropy in general is non-negative.
And that can be mathematically proved.
So if we expand this sum, we'll see that
the equation looks like the second one.
Where I explicitly plugged
in the two values, 0 and 1.
And sometimes when we have 0 log of 0,
we would generally define that as 0,
because log of 0 is undefined.
So this is the entropy function.
And this function will
give a different value for
different distributions
of this random variable.
And it clearly depends on the probability
that the random variable
taking value of 1 or 0.
If we plot this function against
the probability that the random
variable is equal to 1.
And then the function looks like this.
At the two ends,
that means when the probability of X
equals 1 is very small or very large,
then the entropy function has a low value.
When it's 0.5 in the middle
then it reaches the maximum.
Now if we plot the function
against the probability that X
is taking a value of 0 and the function
would show exactly the same curve here,
and you can imagine why.
And so that's because
the two probabilities are symmetric,
and completely symmetric.
So an interesting question you
can think about in general is for
what kind of X does entropy
reach maximum or minimum.
And we can in particular think
about some special cases.
For example, in one case,
we might have a random variable that
always takes a value of 1.
The probability is 1.
Or there's a random variable that
is equally likely taking a value of one or
zero.
So in this case the probability
that X equals 1 is 0.5.
Now which one has a higher entropy?
It's easier to look at the problem
by thinking of a simple example
using coin tossing.
So when we think about random
experiments like tossing a coin,
it gives us a random variable,
that can represent the result.
It can be head or tail.
So we can define a random variable
X sub coin, so that it's 1
when the coin shows up as head,
it's 0 when the coin shows up as tail.
So now we can compute the entropy
of this random variable.
And this entropy indicates how
difficult it is to predict the outcome
of a coin toss.
So we can think about the two cases.
One is a fair coin, it's completely fair.
The coin shows up as head or
tail equally likely.
So the two probabilities would be a half.
Right?
So both are equal to one half.
Another extreme case is
completely biased coin,
where the coin always shows up as heads.
So it's a completely biased coin.
Now let's think about
the entropies in the two cases.
And if you plug in these values you can
see the entropies would be as follows.
For a fair coin we see the entropy
reaches its maximum, that's 1.
For the completely biased coin,
we see it's 0.
And that intuitively makes a lot of sense.
Because a fair coin is
most difficult to predict.
Whereas a completely biased
coin is very easy to predict.
We can always say, well, it's a head.
Because it is a head all the time.
So they can be shown on
the curve as follows.
So the fair coin corresponds to the middle
point where it's very uncertain.
The completely biased coin
corresponds to the end
point where we have a probability
of 1.0 and the entropy is 0.
So, now let's see how we can use
entropy for word prediction.
Let's think about our problem is
to predict whether W is present or
absent in this segment.
Again, think about the three words,
particularly think about their entropies.
Now we can assume high entropy
words are harder to predict.
And so we now have a quantitative way to
tell us which word is harder to predict.
Now if you look at the three words meat,
the, unicorn, again, and
we clearly would expect meat to have
a higher entropy than the unicorn.
In fact if you look at the entropy of the,
it's close to zero.
Because it occurs everywhere.
So it's like a completely biased coin.
Therefore the entropy is zero.
[MUSIC]

[SOUND] This lecture is
about the syntagmatic
relation discovery and
conditional entropy.
In this lecture,
we're going to continue the discussion
of word association mining and analysis.
We're going to talk about the conditional
entropy, which is useful for
discovering syntagmatic relations.
Earlier, we talked about
using entropy to capture
how easy it is to predict the presence or
absence of a word.
Now, we'll address
a different scenario where
we assume that we know something
about the text segment.
So now the question is, suppose we know
that eats occurred in the segment.
How would that help us
predict the presence or
absence of water, like in meat?
And in particular, we want to
know whether the presence of eats
has helped us predict
the presence of meat.
And if we frame this using entrophy,
that would mean we are interested
in knowing whether knowing
the presence of eats could reduce
uncertainty about the meats.
Or, reduce the entrophy
of the random variable
corresponding to the presence or
absence of meat.
We can also ask as a question,
what if we know of the absents of eats?
Would that also help us predict
the presence or absence of meat?
These questions can be
addressed by using another
concept called a conditioning entropy.
So to explain this concept, let's first
look at the scenario we had before,
when we know nothing about the segment.
So we have these probabilities indicating
whether a word like meat occurs,
or it doesn't occur in the segment.
And we have an entropy function that
looks like what you see on the slide.
Now suppose we know eats is present, so
now we know the value of another
random variable that denotes eats.
Now, that would change all
these probabilities to
conditional probabilities.
Where we look at the presence or
absence of meat,
given that we know eats
occurred in the context.
So as a result,
if we replace these probabilities
with their corresponding conditional
probabilities in the entropy function,
we'll get the conditional entropy.
So this equation now here would be
the conditional entropy.
Conditional on the presence of eats.
So, you can see this is essentially
the same entropy function as you have
seen before, except that all
the probabilities now have a condition.
And this then tells us
the entropy of meat,
after we have known eats
occurring in the segment.
And of course, we can also define
this conditional entropy for
the scenario where we don't see eats.
So if we know it did not occur in
the segment, then this entry condition of
entropy would capture the instances
of meat in that condition.
So now,
putting different scenarios together,
we have the completed definition
of conditional entropy as follows.
Basically, we're going to consider both
scenarios of the value of eats zero, one,
and this gives us a probability
that eats is equal to zero or one.
Basically, whether eats is present or
absent.
And this of course,
is the conditional entropy of
meat in that particular scenario.
So if you expanded this entropy,
then you have the following equation.
Where you see the involvement of
those conditional probabilities.
Now in general, for any discrete
random variables x and y, we have
the conditional entropy is no larger
than the entropy of the variable x.
So basically, this is upper bound for
the conditional entropy.
That means by knowing more
information about the segment,
we want to be able to
increase uncertainty.
We can only reduce uncertainty.
And that intuitively makes sense
because as we know more information,
it should always help
us make the prediction.
And cannot hurt
the prediction in any case.
Now, what's interesting here is also to
think about what's the minimum possible
value of this conditional entropy?
Now, we know that the maximum
value is the entropy of X.
But what about the minimum,
so what do you think?
I hope you can reach the conclusion that
the minimum possible value, would be zero.
And it will be interesting to think about
under what situation will achieve this.
So, let's see how we can use conditional
entropy to capture syntagmatic relation.
Now of course,
this conditional entropy gives us directly
one way to measure
the association of two words.
Because it tells us to what extent,
we can predict the one
word given that we know the presence or
absence of another word.
Now before we look at the intuition
of conditional entropy in capturing
syntagmatic relations, it's useful to
think of a very special case, listed here.
That is, the conditional entropy
of the word given itself.
So here,
we listed this conditional
entropy in the middle.
So, it's here.
So, what is the value of this?
Now, this means we know where
the meat occurs in the sentence.
And we hope to predict whether
the meat occurs in the sentence.
And of course, this is 0 because
there's no incident anymore.
Once we know whether the word
occurs in the segment,
we'll already know the answer
of the prediction.
So this is zero.
And that's also when this conditional
entropy reaches the minimum.
So now, let's look at some other cases.
So this is a case of knowing the and
trying to predict the meat.
And this is a case of knowing eats and
trying to predict the meat.
Which one do you think is smaller?
No doubt smaller entropy means easier for
prediction.
Which one do you think is higher?
Which one is not smaller?
Well, if you at the uncertainty,
then in the first case,
the doesn't really tell
us much about the meat.
So knowing the occurrence of the doesn't
really help us reduce entropy that much.
So it stays fairly close to
the original entropy of meat.
Whereas in the case of eats,
eats is related to meat.
So knowing presence of eats or
absence of eats,
would help us predict whether meat occurs.
So it can help us reduce entropy of meat.
So we should expect the sigma term, namely
this one, to have a smaller entropy.
And that means there is a stronger
association between meat and eats.
So we now also know when
this w is the same as this
meat, then the conditional entropy
would reach its minimum, which is 0.
And for what kind of words
would either reach its maximum?
Well, that's when this stuff
is not really related to meat.
And like the for example,
it would be very close to the maximum,
which is the entropy of meat itself.
So this suggests that when you
use conditional entropy for
mining syntagmatic relations,
the hours would look as follows.
For each word W1, we're going to
enumerate the overall other words W2.
And then, we can compute
the conditional entropy of W1 given W2.
We thought all the candidate was in
ascending order of the conditional entropy
because we're out of favor,
a world that has a small entropy.
Meaning that it helps us predict
the time of the word W1.
And then, we're going to take the top ring
of the candidate words as words that have
potential syntagmatic relations with W1.
Note that we need to use
a threshold to find these words.
The stresser can be the number
of top candidates take, or
absolute value for
the conditional entropy.
Now, this would allow us to mine the most
strongly correlated words with
a particular word, W1 here.
But, this algorithm does not
help us mine the strongest
that K syntagmatical relations
from an entire collection.
Because in order to do that, we have to
ensure that these conditional entropies
are comparable across different words.
In this case of discovering
the mathematical relations for
a targeted word like W1, we only need
to compare the conditional entropies
for W1, given different words.
And in this case, they are comparable.
All right.
So, the conditional entropy of W1, given
W2, and the conditional entropy of W1,
given W3 are comparable.
They all measure how hard
it is to predict the W1.
But, if we think about the two pairs,
where we share W2 in the same condition,
and we try to predict the W1 and W3.
Then, the conditional entropies
are actually not comparable.
You can think of about this question.
Why?
So why are they not comfortable?
Well, that was because they
have a different outer bounds.
Right?
So those outer bounds are precisely
the entropy of W1 and the entropy of W3.
And they have different upper bounds.
So we cannot really
compare them in this way.
So how do we address this problem?
Well later, we'll discuss, we can use
mutual information to solve this problem.
[MUSIC]

[SOUND].
This lecture is about the syntagmatic
relation discovery and mutual information.
In this lecture we are going to continue
discussing syntagmatic relation discovery.
In particular,
we are going to talk about another
the concept in the information series,
we called it mutual information and
how it can be used to discover
syntagmatic relations.
Before we talked about the problem
of conditional entropy and
that is the conditional entropy
computed different pairs of words.
It is not really comparable, so
that makes it harder with this cover,
strong synagmatic relations
globally from corpus.
So now we are going to introduce mutual
information, which is another concept
in the information series
that allows us to, sometimes,
normalize the conditional entropy to make
it more comparable across different pairs.
In particular, mutual information
in order to find I(X:Y),
matches the entropy reduction
of X obtained from knowing Y.
More specifically the question we
are interested in here is how much
of an entropy of X can
we obtain by knowing Y.
So mathematically it can be
defined as the difference between
the original entropy of X, and
the condition of Y of X given Y.
And you might see,
as you can see here it can also be defined
as reduction of entropy of
Y because of knowing X.
Now normally the two conditional
interface H of X given Y and
the entropy of Y given X are not equal,
but interestingly,
the reduction of entropy by knowing
one of them, is actually equal.
So, this quantity is called a Mutual
Information in order to buy I here.
And this function has some interesting
properties, first it is also non-negative.
This is easy to understand because
the original entropy is always
not going to be lower than the possibility
reduced conditional entropy.
In other words, the conditional entropy
will never exceed the original entropy.
Knowing some information can
always help us potentially, but
will not hurt us in predicting x.
The signal property is that it
is symmetric like additional
entropy is not symmetrical,
mutual information is, and
the third property is that It
reaches its minimum, zero, if and
only if the two random variables
are completely independent.
That means knowing one of them does not
tell us anything about the other and
this last property can be verified by
simply looking at the equation above and
it reaches 0 if and
only the conditional entropy of X
[INAUDIBLE] Y is exactly the same
as original entropy of X.
So that means knowing why it did not
help at all and that is when X and
a Y are completely independent.
Now when we fix X to rank different
Ys using conditional entropy
would give the same order as
ranking based on mutual information
because in the function here,
H(X) is fixed because X is fixed.
So ranking based on mutual entropy is
exactly the same as ranking based on
the conditional entropy of X given Y, but
the mutual information allows us to
compare different pairs of x and y.
So, that is why mutual information is
more general and in general, more useful.
So, let us examine the intuition
of using mutual information for
Syntagmatical Relation Mining.
Now, the question we ask forcing
that relation mining is,
whenever "eats" occurs,
what other words also tend to occur?
So this question can be framed as
a mutual information question, that is,
which words have high mutual
information was eats,
so computer the missing information
between eats and other words.
And if we do that, and it is basically
a base on the same as conditional
we will see that words that
are strongly associated with eats,
will have a high point.
Whereas words that are not related
will have lower mutual information.
For this, I will give some example here.
The mutual information between "eats" and
"meats",
which is the same as between "meats" and
"eats," because the information is
symmetrical is expected to be higher than
the mutual information between eats and
the, because knowing the does not
really help us as a predictor.
It is similar, and
knowing eats does not help us predicting,
the as well.
And you also can easily
see that the mutual
information between a word and
itself is the largest,
which is equal to
the entropy of this word and
so, because in this case the reduction is
maximum because knowing one allows
us to predict the other completely.
So the conditional entropy is zero,
therefore the mutual information
reaches its maximum.
It is going to be larger, then are equal
to the machine volume eats in other words.
In other words picking any other word and
the computer picking between eats and
that word.
You will not get any information larger
the computation from eats and itself.
So now let us look at how to
compute the mute information.
Now in order to do that, we often
use a different form of mutual
information, and we can mathematically
rewrite the mutual information
into the form shown on this slide.
Where we essentially see
a formula that computes what is
called a KL-divergence or divergence.
This is another term
in information theory.
It measures the divergence
between two distributions.
Now, if you look at the formula,
it is also sum over many combinations of
different values of the two random
variables but inside the sum,
mainly we are doing a comparison
between two joint distributions.
The numerator has the joint,
actual observed the joint distribution
of the two random variables.
The bottom part or the denominator can be
interpreted as the expected joint
distribution of the two random variables,
if they were independent because when
two random variables are independent,
they are joined distribution is equal to
the product of the two probabilities.
So this comparison will tell us whether
the two variables are indeed independent.
If they are indeed independent then we
would expect that the two are the same,
but if the numerator is different
from the denominator, that would mean
the two variables are not independent and
that helps measure the association.
The sum is simply to take into
consideration of all of the combinations
of the values of these
two random variables.
In our case, each random variable
can choose one of the two values,
zero or one, so
we have four combinations here.
If we look at this form of mutual
information, it shows that the mutual
information matches the divergence
of the actual joint distribution
from the expected distribution
under the independence assumption.
The larger this divergence is, the higher
the mutual information would be.
So now let us further look at what
are exactly the probabilities,
involved in this formula
of mutual information.
And here, this is all the probabilities
involve, and it is easy for
you to verify that.
Basically, we have first to
[INAUDIBLE] probabilities
corresponding to the presence or
absence of each word.
So, for w1,
we have two probabilities shown here.
They should sum to one, because a word
can either be present or absent.
In the segment, and similarly for
the second word, we also have two
probabilities representing presence or
absences of this word, and
there is some to y as well.
And finally, we have a lot of
joined probabilities that represent
the scenarios of co-occurrences of
the two words, and they are shown here.
And they sum to one because the two
words can only have these four
possible scenarios.
Either they both occur, so
in that case both variables will have
a value of one, or one of them occurs.
There are two scenarios.
In these two cases one of the random
variables will be equal to one and
the other will be zero and finally we have
the scenario when none of them occurs.
This is when the two variables
taking a value of zero.
So these are the probabilities involved
in the calculation of mutual information,
over here.
Once we know how to calculate
these probabilities,
we can easily calculate
the new gene formation.
It is also interesting to know that
there are actually some relations or
constraint among these probabilities,
and we already saw two of them, right?
So in the previous slide,
that you have seen that
the marginal probabilities of these
words sum to one and
we also have seen this constraint,
that says the two words have these
four scenarios of co-occurrency,
but we also have some additional
constraints listed in the bottom.
For example, this one means if we add up
the probabilities that we observe
the two words occur together and
the probabilities when the first word
occurs and the second word does not occur.
We get exactly the probability
that the first word is observed.
In other words, when the word is observed.
When the first word is observed, and
there are only two scenarios, depending on
whether the second word is also observed.
So, this probability captures the first
scenario when the second word
actually is also observed, and
this captures the second scenario
when the second word is not observed.
So, we only see the first word, and
it is easy to see the other equations
also follow the same reasoning.
Now these equations allow us to
compute some probabilities based on
other probabilities, and
this can simplify the computation.
So more specifically,
if we know the probability that
a word is present, like in this case,
so if we know this, and
if we know the probability of
the presence of the second word,
then we can easily compute
the absence probability, right?
It is very easy to use this
equation to do that, and so
we take care of the computation of
these probabilities of presence and
absence of each word.
Now let's look at
the [INAUDIBLE] distribution.
Let us assume that we also have available
the probability that
they occurred together.
Now it is easy to see that we can
actually compute all the rest of these
probabilities based on these.
Specifically for
example using this equation we can compute
the probability that the first word
occurred and the second word did not,
because we know these probabilities in
the boxes, and similarly using this
equation we can compute the probability
that we observe only the second word.
Word.
And then finally,
this probability can be calculated
by using this equation because
now this is known, and
this is also known, and
this is already known, right.
So this can be easier to calculate.
So now this can be calculated.
So this slide shows that we only
need to know how to compute
these three probabilities
that are shown in the boxes,
naming the presence of each word and the
co-occurence of both words, in a segment.
[MUSIC]

[SOUND]
In general, we can use the empirical count
of events in the observed data
to estimate the probabilities.
And a commonly used technique is
called a maximum likelihood estimate,
where we simply normalize
the observe accounts.
So if we do that, we can see, we can
compute these probabilities as follows.
For estimating the probability that
we see a water current in a segment,
we simply normalize the count of
segments that contain this word.
So let's first take
a look at the data here.
On the right side, you see a list of some,
hypothesizes the data.
These are segments.
And in some segments you see both words
occur, they are indicated as ones for
both columns.
In some other cases only one will occur,
so only that column has one and
the other column has zero.
And in all, of course, in some other
cases none of the words occur,
so they are both zeros.
And for estimating these probabilities, we
simply need to collect the three counts.
So the three counts are first,
the count of W1.
And that's the total number of
segments that contain word W1.
It's just as the ones in the column of W1.
We can count how many
ones we have seen there.
The segment count is for word 2, and we
just count the ones in the second column.
And these will give us the total
number of segments that contain W2.
The third count is when both words occur.
So this time, we're going to count
the sentence where both columns have ones.
And then, so this would give us
the total number of segments
where we have seen both W1 and W2.
Once we have these counts,
we can just normalize these counts by N,
which is the total number of segments, and
this will give us the probabilities that
we need to compute original information.
Now, there is a small problem,
when we have zero counts sometimes.
And in this case, we don't want a zero
probability because our data may be
a small sample and in general, we would
believe that it's potentially possible for
a [INAUDIBLE] to avoid any context.
So, to address this problem,
we can use a technique called smoothing.
And that's basically to add some
small constant to these counts,
and so that we don't get
the zero probability in any case.
Now, the best way to understand smoothing
is imagine that we actually observed more
data than we actually have, because we'll
pretend we observed some pseudo-segments.
I illustrated on the top,
on the right side on the slide.
And these pseudo-segments would
contribute additional counts
of these words so
that no event will have zero probability.
Now, in particular we introduce
the four pseudo-segments.
Each is weighted at one quarter.
And these represent the four different
combinations of occurrences of this word.
So now each event,
each combination will have
at least one count or at least a non-zero
count from this pseudo-segment.
So, in the actual segments
that we'll observe,
it's okay if we haven't observed
all of the combinations.
So more specifically, you can see
the 0.5 here after it comes from the two
ones in the two pseudo-segments,
because each is weighted at one quarter.
We add them up, we get 0.5.
And similar to this,
0.05 comes from one single
pseudo-segment that indicates
the two words occur together.
And of course in the denominator we add
the total number of pseudo-segments that
we add, in this case,
we added a four pseudo-segments.
Each is weighed at one quarter so
the total of the sum is, after the one.
So, that's why in the denominator
you'll see a one there.
So, this basically concludes
the discussion of how to compute a these
four syntagmatic relation discoveries.
Now, so to summarize,
syntagmatic relation can generally
be discovered by measuring correlations
between occurrences of two words.
We've introduced the three
concepts from information theory.
Entropy, which measures the uncertainty
of a random variable X.
Conditional entropy, which measures
the entropy of X given we know Y.
And mutual information of X and Y,
which matches the entropy reduction of X
due to knowing Y, or
entropy reduction of Y due to knowing X.
They are the same.
So these three concepts are actually very
useful for other applications as well.
That's why we spent some time
to explain this in detail.
But in particular,
they are also very useful for
discovering syntagmatic relations.
In particular,
mutual information is a principal way for
discovering such a relation.
It allows us to have values
computed on different pairs of
words that are comparable and
so we can rank these pairs and
discover the strongest syntagmatic
from a collection of documents.
Now, note that there is some relation
between syntagmatic relation discovery and
[INAUDIBLE] relation discovery.
So we already discussed the possibility
of using BM25 to achieve waiting for
terms in the context to potentially
also suggest the candidates
that have syntagmatic relations
with the candidate word.
But here, once we use mutual information
to discover syntagmatic relations,
we can also represent the context with
this mutual information as weights.
So this would give us
another way to represent
the context of a word, like a cat.
And if we do the same for all the words,
then we can cluster these words or
compare the similarity between these
words based on their context similarity.
So this provides yet
another way to do term weighting for
paradigmatic relation discovery.
And so to summarize this whole part
about word association mining.
We introduce two basic associations,
called a paradigmatic and
a syntagmatic relations.
These are fairly general, they apply
to any items in any language, so
the units don't have to be words,
they can be phrases or entities.
We introduced multiple statistical
approaches for discovering them,
mainly showing that pure
statistical approaches are visible,
are variable for
discovering both kind of relations.
And they can be combined to
perform joint analysis, as well.
These approaches can be applied
to any text with no human effort,
mostly because they are based
on counting of words, yet
they can actually discover
interesting relations of words.
We can also use different ways with
defining context and segment, and
this would lead us to some interesting
variations of applications.
For example, the context can be very
narrow like a few words, around a word, or
a sentence, or maybe paragraphs,
as using differing contexts would
allows to discover different flavors
of paradigmatical relations.
And similarly,
counting co-occurrences using let's say,
visual information to discover
syntagmatical relations.
We also have to define the segment, and
the segment can be defined as a narrow
text window or a longer text article.
And this would give us different
kinds of associations.
These discovery associations can
support many other applications,
in both information retrieval and
text and data mining.
So here are some recommended readings,
if you want to know more about the topic.
The first is a book with
a chapter on collocations,
which is quite relevant to
the topic of these lectures.
The second is an article
about using various
statistical measures to
discover lexical atoms.
Those are phrases that
are non-compositional.
For example,
hot dog is not really a dog that's hot,
blue chip is not a chip that's blue.
And the paper has a discussion about some
techniques for discovering such phrases.
The third one is a new paper on a unified
way to discover both paradigmatical
relations and a syntagmatical relations,
using random works on word graphs.
[SOUND]

[SOUND]
So,
looking at the text mining problem more
closely, we see that the problem is
similar to general data mining, except
that we'll be focusing more on text data.
And we're going to have text mining
algorithms to help us to turn text data
into actionable knowledge that
we can use in real world,
especially for decision making, or
for completing whatever tasks that
require text data to support.
Because, in general,
in many real world problems of data mining
we also tend to have other kinds
of data that are non-textual.
So a more general picture would be
to include non-text data as well.
And for this reason we might be
concerned with joint mining of text and
non-text data.
And so in this course we're
going to focus more on text mining,
but we're also going to also touch how do
to joint analysis of both text data and
non-text data.
With this problem definition we
can now look at the landscape of
the topics in text mining and analytics.
Now this slide shows the process of
generating text data in more detail.
More specifically, a human sensor or
human observer would look at
the word from some perspective.
Different people would be looking at
the world from different angles and
they'll pay attention to different things.
The same person at different times might
also pay attention to different aspects
of the observed world.
And so the humans are able to perceive
the world from some perspective.
And that human, the sensor,
would then form a view of the world.
And that can be called the Observed World.
Of course, this would be different from
the Real World because of the perspective
that the person has taken
can often be biased also.
Now the Observed World can be
represented as, for example,
entity-relation graphs or
in a more general way,
using knowledge representation language.
But in general, this is basically what
a person has in mind about the world.
And we don't really know what
exactly it looks like, of course.
But then the human would
express what the person has
observed using a natural language,
such as English.
And the result is text data.
Of course a person could have used
a different language to express what he or
she has observed.
In that case we might have text data of
mixed languages or different languages.
The main goal of text mining
Is actually to revert this
process of generating text data.
We hope to be able to uncover
some aspect in this process.
Specifically, we can think about mining,
for example, knowledge about the language.
And that means by looking at text data
in English, we may be able to discover
something about English, some usage
of English, some patterns of English.
So this is one type of mining problems,
where the result is
some knowledge about language which
may be useful in various ways.
If you look at the picture,
we can also then mine knowledge
about the observed world.
And so this has much to do with
mining the content of text data.
We're going to look at what the text
data are about, and then try to
get the essence of it or
extracting high quality information
about a particular aspect of
the world that we're interested in.
For example, everything that has been
said about a particular person or
a particular entity.
And this can be regarded as mining content
to describe the observed world in
the user's mind or the person's mind.
If you look further,
then you can also imagine
we can mine knowledge about this observer,
himself or herself.
So this has also to do with
using text data to infer
some properties of this person.
And these properties could
include the mood of the person or
sentiment of the person.
And note that we distinguish
the observed word from the person
because text data can't describe what the
person has observed in an objective way.
But the description can be also
subjected with sentiment and so,
in general, you can imagine the text
data would contain some factual
descriptions of the world plus
some subjective comments.
So that's why it's also possible to
do text mining to mine
knowledge about the observer.
Finally, if you look at the picture
to the left side of this picture,
then you can see we can certainly also
say something about the real world.
Right?
So indeed we can do text mining to
infer other real world variables.
And this is often called
a predictive analytics.
And we want to predict the value
of certain interesting variable.
So, this picture basically covered
multiple types of knowledge that
we can mine from text in general.
When we infer other
real world variables we
could also use some of the results from
mining text data as intermediate
results to help the prediction.
For example,
after we mine the content of text data we
might generate some summary of content.
And that summary could be then used
to help us predict the variables
of the real world.
Now of course this is still generated
from the original text data,
but I want to emphasize here that
often the processing of text data
to generate some features that can help
with the prediction is very important.
And that's why here we show the results of
some other mining tasks, including
mining the content of text data and
mining knowledge about the observer,
can all be very helpful for prediction.
In fact, when we have non-text data,
we could also use the non-text
data to help prediction, and
of course it depends on the problem.
In general, non-text data can be very
important for such prediction tasks.
For example,
if you want to predict stock prices or
changes of stock prices based on
discussion in the news articles or
in social media, then this is an example
of using text data to predict
some other real world variables.
But in this case, obviously,
the historical stock price data would
be very important for this prediction.
And so that's an example of
non-text data that would be very
useful for the prediction.
And we're going to combine both kinds
of data to make the prediction.
Now non-text data can be also used for
analyzing text by supplying context.
When we look at the text data alone,
we'll be mostly looking at the content
and/or opinions expressed in the text.
But text data generally also
has context associated.
For example, the time and the location
that associated are with the text data.
And these are useful context information.
And the context can provide interesting
angles for analyzing text data.
For example, we might partition text
data into different time periods
because of the availability of the time.
Now we can analyze text data in each
time period and then make a comparison.
Similarly we can partition text
data based on locations or
any meta data that's associated to
form interesting comparisons in areas.
So, in this sense,
non-text data can actually provide
interesting angles or
perspectives for text data analysis.
And it can help us make context-sensitive
analysis of content or
the language usage or
the opinions about the observer or
the authors of text data.
We could analyze the sentiment
in different contexts.
So this is a fairly general landscape of
the topics in text mining and analytics.
In this course we're going to
selectively cover some of those topics.
We actually hope to cover
most of these general topics.
First we're going to cover
natural language processing very
briefly because this has to do
with understanding text data and
this determines how we can represent
text data for text mining.
Second, we're going to talk about how to
mine word associations from text data.
And word associations is a form of use for
lexical knowledge about a language.
Third, we're going to talk about
topic mining and analysis.
And this is only one way to
analyze content of text, but
it's a very useful ways
of analyzing content.
It's also one of the most useful
techniques in text mining.
Then we're going to talk about
opinion mining and sentiment analysis.
So this can be regarded as one example
of mining knowledge about the observer.
And finally we're going to
cover text-based prediction
problems where we try to predict some
real world variable based on text data.
So this slide also serves as
a road map for this course.
And we're going to use
this as an outline for
the topics that we'll cover
in the rest of this course.
[MUSIC]

[SOUND]
This lecture is about natural language
content analysis.
Natural language content analysis
is the foundation of text mining.
So we're going to first talk about this.
And in particular,
natural language processing with
a factor how we can present text data.
And this determines what algorithms can
be used to analyze and mine text data.
We're going to take a look at the basic
concepts in natural language first.
And I'm going to explain these concepts
using a similar example
that you've all seen here.
A dog is chasing a boy on the playground.
Now this is a very simple sentence.
When we read such a sentence
we don't have to think
about it to get the meaning of it.
But when a computer has to
understand the sentence,
the computer has to go
through several steps.
First, the computer needs
to know what are the words,
how to segment the words in English.
And this is very easy,
we can just look at the space.
And then the computer will need
the know the categories of these words,
syntactical categories.
So for example, dog is a noun,
chasing's a verb, boy is another noun etc.
And this is called a Lexical analysis.
In particular, tagging these words
with these syntactic categories
is called a part-of-speech tagging.
After that the computer also needs to
figure out the relationship between
these words.
So a and dog would form a noun phrase.
On the playground would be
a prepositional phrase, etc.
And there is certain way for
them to be connected together in order for
them to create meaning.
Some other combinations
may not make sense.
And this is called syntactical parsing, or
syntactical analysis,
parsing of a natural language sentence.
The outcome is a parse tree
that you are seeing here.
That tells us the structure
of the sentence, so
that we know how we can
interpret this sentence.
But this is not semantics yet.
So in order to get the meaning we
would have to map these phrases and
these structures into some real world
antithesis that we have in our mind.
So dog is a concept that we know,
and boy is a concept that we know.
So connecting these phrases
that we know is understanding.
Now for a computer, would have to formally
represent these entities by using symbols.
So dog, d1 means d1 is a dog.
Boy, b1 means b1 refers to a boy etc.
And also represents the chasing
action as a predicate.
So, chasing is a predicate here with
three arguments, d1, b1, and p1.
Which is playground.
So this formal rendition of
the semantics of this sentence.
Once we reach that level of understanding,
we might also make inferences.
For example, if we assume there's a rule
that says if someone's being chased then
the person can get scared, then we
can infer this boy might be scared.
This is the inferred meaning,
based on additional knowledge.
And finally, we might even further infer
what this sentence is requesting,
or why the person who say it in
a sentence, is saying the sentence.
And so, this has to do with
purpose of saying the sentence.
This is called speech act analysis or
pragmatic analysis.
Which first to the use of language.
So, in this case a person saying this
may be reminding another person to
bring back the dog.
So this means when saying a sentence,
the person actually takes an action.
So the action here is to make a request.
Now, this slide clearly shows that
in order to really understand
a sentence there are a lot of
things that a computer has to do.
Now, in general it's very hard for
a computer will do everything,
especially if you would want
it to do everything correctly.
This is very difficult.
Now, the main reason why natural
language processing is very difficult,
it's because it's designed it will
make human communications efficient.
As a result, for example,
with only a lot of common sense knowledge.
Because we assume all of
us have this knowledge,
there's no need to encode this knowledge.
That makes communication efficient.
We also keep a lot of ambiguities,
like, ambiguities of words.
And this is again, because we assume we
have the ability to disambiguate the word.
So, there's no problem with
having the same word to mean
possibly different things
in different context.
Yet for
a computer this would be very difficult
because a computer does not have
the common sense knowledge that we do.
So the computer will be confused indeed.
And this makes it hard for
natural language processing.
Indeed, it makes it very hard for
every step in the slide
that I showed you earlier.
Ambiguity is a main killer.
Meaning that in every step
there are multiple choices,
and the computer would have to
decide whats the right choice and
that decision can be very difficult
as you will see also in a moment.
And in general,
we need common sense reasoning in order
to fully understand the natural language.
And computers today don't yet have that.
That's why it's very hard for
computers to precisely understand
the natural language at this point.
So here are some specific
examples of challenges.
Think about the world-level ambiguity.
A word like design can be a noun or
a verb, so
we've got ambiguous part of speech tag.
Root also has multiple meanings,
it can be of mathematical sense,
like in the square of, or
can be root of a plant.
Syntactic ambiguity refers
to different interpretations
of a sentence in terms structures.
So for example,
natural language processing can
actually be interpreted in two ways.
So one is the ordinary meaning that we
will be getting as we're
talking about this topic.
So, it's processing of natural language.
But there's is also another
possible interpretation
which is to say language
processing is natural.
Now we don't generally have this problem,
but imagine for the computer to determine
the structure, the computer would have
to make a choice between the two.
Another classic example is a man
saw a boy with a telescope.
And this ambiguity lies in
the question who had the telescope?
This is called a prepositional
phrase attachment ambiguity.
Meaning where to attach this
prepositional phrase with the telescope.
Should it modify the boy?
Or should it be modifying, saw, the verb.
Another problem is anaphora resolution.
In John persuaded Bill to buy a TV for
himself.
Does himself refer to John or Bill?
Presupposition is another difficulty.
He has quit smoking implies
that he smoked before, and
we need to have such a knowledge in
order to understand the languages.
Because of these problems, the state
of the art natural language processing
techniques can not do anything perfectly.
Even for
the simplest part of speech tagging,
we still can not solve the whole problem.
The accuracy that are listed here,
which is about 97%,
was just taken from some studies earlier.
And these studies obviously have to
be using particular data sets so
the numbers here are not
really meaningful if you
take it out of the context of the data
set that are used for evaluation.
But I show these numbers mainly to give
you some sense about the accuracy,
or how well we can do things like this.
It doesn't mean any data set
accuracy would be precisely 97%.
But, in general, we can do parsing speech
tagging fairly well although not perfect.
Parsing would be more difficult, but for
partial parsing, meaning to get some
phrases correct, we can probably
achieve 90% or better accuracy.
But to get the complete parse tree
correctly is still very, very difficult.
For semantic analysis, we can also do
some aspects of semantic analysis,
particularly, extraction of entities and
relations.
For example, recognizing this is
the person, that's a location, and
this person and
that person met in some place etc.
We can also do word sense to some extent.
The occurrence of root in this sentence
refers to the mathematical sense etc.
Sentiment analysis is another aspect
of semantic analysis that we can do.
That means we can tag the senses
as generally positive when
it's talking about the product or
talking about the person.
Inference, however, is very hard,
and we generally cannot do that for
any big domain and if it's only
feasible for a very limited domain.
And that's a generally difficult
problem in artificial intelligence.
Speech act analysis is
also very difficult and
we can only do this probably for
very specialized cases.
And with a lot of help from humans
to annotate enough data for
the computers to learn from.
So the slide also shows that
computers are far from being able to
understand natural language precisely.
And that also explains why the text
mining problem is difficult.
Because we cannot rely on
mechanical approaches or
computational methods to
understand the language precisely.
Therefore, we have to use
whatever we have today.
A particular statistical machine learning
method of statistical analysis methods
to try to get as much meaning
out from the text as possible.
And, later you will see
that there are actually
many such algorithms
that can indeed extract
interesting model from text even though
we cannot really fully understand it.
Meaning of all the natural
language sentences precisely.
[MUSIC]

[SOUND]
So here are some specific examples of what
we can't do today and
part of speech tagging is still
not easy to do 100% correctly.
So in the example, he turned off the
highway verses he turned off the fan and
the two offs actually have somewhat
a differentness in their active
categories and also its very difficult
to get a complete the parsing correct.
Again, the example, a man saw a boy
with a telescope can actually
be very difficult to parse
depending on the context.
Precise deep semantic
analysis is also very hard.
For example, to define the meaning of own,
precisely is very difficult in
the sentence, like John owns a restaurant.
So the state of the off can
be summarized as follows.
Robust and
general NLP tends to be shallow while
a deep understanding does not scale up.
For this reason in this course,
the techniques that we cover are in
general, shallow techniques for
analyzing text data and
mining text data and they are generally
based on statistical analysis.
So there are robust and
general and they are in
the in category of shallow analysis.
So such techniques have
the advantage of being able to be
applied to any text data in
any natural about any topic.
But the downside is that, they don't
give use a deeper understanding of text.
For that, we have to rely on
deeper natural language analysis.
That typically would require
a human effort to annotate
a lot of examples of analysis that would
like to do and then computers can use
machine learning techniques and learn from
these training examples to do the task.
So in practical applications, we generally
combine the two kinds of techniques
with the general statistical and
methods as a backbone as the basis.
These can be applied to any text data.
And on top of that, we're going to use
humans to, and you take more data and
to use supervised machine learning
to do some tasks as well as we can,
especially for those important
tasks to bring humans into the loop
to analyze text data more precisely.
But this course will cover
the general statistical approaches
that generally,
don't require much human effort.
So they're practically,
more useful that some of the deeper
analysis techniques that require a lot of
human effort to annotate the text today.
So to summarize,
the main points we take are first NLP
is the foundation for text mining.
So obviously, the better we
can understand the text data,
the better we can do text mining.
Computers today are far from being able
to understand the natural language.
Deep NLP requires common sense
knowledge and inferences.
Thus, only working for
very limited domains not feasible for
large scale text mining.
Shallow NLP based on statistical
methods can be done in large scale and
is the main topic of this course and
they are generally applicable
to a lot of applications.
They are in some sense also,
more useful techniques.
In practice,
we use statistical NLP as the basis and
we'll have humans for
help as needed in various ways.
[MUSIC]

[SOUND] This lecture is
about Text Representation.
In this lecture we're going to discuss
text representation and discuss how
natural language processing can allow us
to represent text in many different ways.
Let's take a look at this
example sentence again.
We can represent this sentence
in many different ways.
First, we can always represent such
a sentence as a string of characters.
This is true for all the languages.
When we store them in the computer.
When we store a natural language
sentence as a string of characters.
We have perhaps the most general
way of representing text since
we can always use this approach
to represent any text data.
But unfortunately using such
a representation will not help us to
semantic analysis, which is often needed
for many applications of text mining.
The reason is because we're
not even recognizing words.
So as a string we are going to keep all
of the spaces and these ascii symbols.
We can perhaps count out what's
the most frequent character in
the English text or
the correlation between those characters.
But we can't really analyze semantics, yet
this is the most general way of
representing text because we
hadn't used this to represent
any natural language or text.
If we try to do a little bit more
natural language processing by
doing word segmentation,
then we can obtain a representation
of the same text, but
in the form of a sequence of words.
So here we see that we can identify words,
like a dog is chasing, etc.
Now with this level of representation
we suddenly can do a lot of things.
And this is mainly because words are the
basic units of human communication and
natural language.
So they are very powerful.
By identifying words, we can for
example, easily count what
are the most frequent words in this
document or in the whole collection, etc.
And these words can be
used to form topics.
When we combine related words together and
some words positive and
some words are negatives or
we can also do analysis.
So representing text data as a sequence
of words opens up a lot of interesting
analysis possibilities.
However, this level of representation
is slightly less general than string of
characters.
Because in some languages, such as
Chinese, it's actually not that easy to
identified all the word boundaries,
because in such a language you see
text as a sequence of characters
with no space in between.
So you have to rely on some special
techniques to identify words.
In such a language of course then we
might make mistakes in segmenting words.
So the sequence of words representation
is not as robust as string of characters.
But in English, it's very easy to
obtain this level of representation.
So we can do that all the time.
Now if we go further to do in that round
of processing we can add a part of
these text.
Now once we do that we can count, for
example, the most frequent nouns or
what kind of nouns are associated
with what kind of verbs, etc.
So, this opens up a little bit
more interesting opportunities for
further analysis.
Note that I use a plus sign here because
by representing text as a sequence
of part of speech tags,
we don't necessarily replace
the original word sequence written.
Instead, we add this as an additional
way or representing text data.
So now the data is represented
as both a sequence of words and
a sequence of part of speech tags.
This enriches the representation
of text data, and,
thus also enables a more
interesting analysis.
If we go further,
then we'll be pausing the sentence
to obtain a syntactic structure.
Now this of course will
further open up more
interesting analysis of, for example,
the writing styles or
correcting grammar mistakes.
If we go further for semantic analysis.
Then we might be able to
recognize dog as an animal.
And we also can recognize boy as a person,
and playground as a location.
And we can further
analyse their relations.
For example, dog was chasing the boy,
and boy is on the playground.
This will add more entities and relations,
through entity relation recreation.
At this level,
we can do even more interesting things.
For example, now we can counter
easily the most frequent person
that's managing this whole
collection of news articles.
Or whenever you mention this person
you also tend to see mentioning
of another person, etc.
So this is very a useful representation.
And it's also related to the knowledge
graph that some of you may have heard of
that Google is doing as a more semantic
way of representing text data.
However it's also less
robust sequence of words.
Or even syntactical analysis,
because it's not always easy
to identify all the entities with the
right types and we might make mistakes.
And relations are even harder to find and
we might make mistakes.
This makes this level of representation
less robust, yet it's very useful.
Now if we move further to logic group
condition then we have predicates and
inference rules.
With inference rules we can infer
interesting derived facts from the text.
So that's very useful but
unfortunately, this level of
representation is even less robust and
we can make mistakes.
And we can't do that all the time for
all kinds of sentences.
And finally speech acts would add a yet
another level of rendition of
the intent of saying this sentence.
So in this case it might be a request.
So knowing that would allow us to you
know analyze more even more interesting
things about the observer or
the author of this sentence.
What's the intention of saying that?
What scenarios or
what kind of actions will be made?
So this is, Another role of analysis
that would be very interesting.
So this picture shows that if
we move down, we generally see
more sophisticated and natural language
processing techniques will be used.
And unfortunately such techniques
would require more human effort.
And they are less accurate.
That means there are mistakes.
So if we analyze our text at
the levels that are representing
deeper analysis of language then
we have to tolerate errors.
So that also means it's still necessary
to combine such deep analysis
with shallow analysis based on,
for example, sequence of words.
On the right side, you see the arrow
points down to indicate that
as we go down, with our representation of
text is closer to knowledge representation
in our mind and need for
solving a lot of problems.
Now, this is desirable because as we can
represent text as a level of knowledge,
we can easily extract the knowledge.
That's the purpose of text mining.
So, there was a trade off here.
Between doing deeper analysis
that might have errors but
would give us direct knowledge
that can be extracted from text.
And doing shadow analysis
which is more robust but
wouldn't actually give us the necessary
deeper representation of knowledge.
I should also say that text
data are generated by humans,
and are meant to be consumed by humans.
So as a result, in text data analysis,
text mining,
humans play a very important role.
They are always in the loop,
meaning that we should optimize
a collaboration of humans and computers.
So, in that sense it's okay that
computers may not be able to
have completely accurate
representation of text data.
And patterns that are extracted from
text data can be interpreted by humans.
And then humans can guide the computers to
do more accurate analysis by annotating
more data, by providing features to
guide machine learning programs,
to make them work more effectively.
[MUSIC]

[SOUND].
So, as we explained the different text
representation tends to
enable different analysis.
In particular,
we can gradually add more and
more deeper analysis results
to represent text data.
And that would open up a more
interesting representation
opportunities and
also analysis capacities.
So, this table summarizes
what we have just seen.
So the first column shows
the text representation.
The second visualizes the generality
of such a representation.
Meaning whether we can do this
kind of representation accurately for
all the text data or only some of them.
And the third column shows
the enabled analysis techniques.
And the final column shows some
examples of application that
can be achieved through this
level of representation.
So let's take a look at them.
So as a stream text can only be processed
by stream processing algorithms.
It's very robust, it's general.
And there was still some interesting
applications that can be down
at this level.
For example, compression of text.
Doesn't necessarily need to
know the word boundaries.
Although knowing word boundaries
might actually also help.
Word base repetition is a very
important level of representation.
It's quite general and
relatively robust, indicating they
were a lot of analysis techniques.
Such as word relation analysis,
topic analysis and sentiment analysis.
And there are many applications that can
be enabled by this kind of analysis.
For example, thesaurus discovery has
to do with discovering related words.
And topic and
opinion related applications are abounded.
And there are, for example, people
might be interesting in knowing the major
topics covered in the collection of texts.
And this can be the case
in research literature.
And scientists want to know what are the
most important research topics today.
Or customer service people might want to
know all our major complaints from their
customers by mining their e-mail messages.
And business intelligence
people might be interested in
understanding consumers' opinions about
their products and the competitors'
products to figure out what are the
winning features of their products.
And, in general, there are many
applications that can be enabled by
the representation at this level.
Now, moving down, we'll see we can
gradually add additional representations.
By adding syntactical structures,
we can enable, of course,
syntactical graph analysis.
We can use graph mining algorithms
to analyze syntactic graphs.
And some applications are related
to this kind of representation.
For example,
stylistic analysis generally requires
syntactical structure representation.
We can also generate
the structure based features.
And those are features that might help us
classify the text objects into different
categories by looking at the structures
sometimes in the classification.
It can be more accurate.
For example,
if you want to classify articles into
different categories corresponding
to different authors.
You want to figure out which of
the k authors has actually written
this article, then you generally need
to look at the syntactic structures.
When we add entities and relations,
then we can enable other techniques
such as knowledge graph and
answers, or information network and
answers in general.
And this analysis enable
applications about entities.
For example,
discovery of all the knowledge and
opinions about real world entities.
You can also use this level representation
to integrate everything about
anything from scaled resources.
Finally, when we add logical predicates,
that would enable large inference,
of course.
And this can be very useful for
integrating analysis of
scattered knowledge.
For example,
we can also add ontology on top of the,
extracted the information from text,
to make inferences.
A good of example of application in this
enabled by this level of representation,
is a knowledge assistant for biologists.
And this program that can help a biologist
manage all the relevant knowledge from
literature about a research problem such
as understanding functions of genes.
And the computer can make inferences
about some of the hypothesis that
the biologist might be interesting.
For example,
whether a gene has a certain function, and
then the intelligent program can read the
literature to extract the relevant facts,
doing compiling and
information extracting.
And then using a logic system to
actually track that's the answers
to researchers questioning about what
genes are related to what functions.
So in order to support
this level of application
we need to go as far as
logical representation.
Now, this course is covering techniques
mainly based on word based representation.
And these techniques are general and
robust and that's more widely
used in various applications.
In fact, in virtually all the text mining
applications you need this level of
representation and then techniques that
support analysis of text in this level.
But obviously all these other
levels can be combined and
should be combined in order to support
the sophisticated applications.
So to summarize,
here are the major takeaway points.
Text representation determines what
kind of mining algorithms can be applied.
And there are multiple ways to
represent the text, strings, words,
syntactic structures, entity-relation
graphs, knowledge predicates, etc.
And these different
representations should in general
be combined in real applications
to the extent we can.
For example, even if we cannot
do accurate representations
of syntactic structures, we can state
that partial structures strictly.
And if we can recognize some entities,
that would be great.
So in general we want to
do as much as we can.
And when different levels
are combined together,
we can enable a richer analysis,
more powerful analysis.
This course however focuses
on word-based representation.
Such techniques have also several
advantage, first of they are general and
robust, so they are applicable
to any natural language.
That's a big advantage over
other approaches that rely on
more fragile natural language
processing techniques.
Secondly, it does not require
much manual effort, or
sometimes, it does not
require any manual effort.
So that's, again, an important benefit,
because that means that you can apply
it directly to any application.
Third, these techniques are actually
surprisingly powerful and
effective form in implications.
Although not all of course
as I just explained.
Now they are very effective
partly because the words
are invented by humans as basically
units for communications.
So they are actually quite sufficient for
representing all kinds of semantics.
So that makes this kind of word-based
representation all so powerful.
And finally, such a word-based
representation and the techniques enable
by such a representation can be combined
with many other sophisticated approaches.
So they're not competing with each other.
[MUSIC]

[SOUND] This lecture is
about the word association
mining and analysis.
In this lecture,
we're going to talk about how to mine
associations of words from text.
Now this is an example of knowledge
about the natural language that
we can mine from text data.
Here's the outline.
We're going to first talk about
what is word association and
then explain why discovering such
relations is useful and finally
we're going to talk about some general
ideas about how to mine word associations.
In general there are two word
relations and these are quite basic.
One is called a paradigmatic relation.
The other is syntagmatic relation.
A and B have paradigmatic relation
if they can be substituted for each other.
That means the two words that
have paradigmatic relation
would be in the same semantic class,
or syntactic class.
And we can in general
replace one by the other
without affecting
the understanding of the sentence.
That means we would still
have a valid sentence.
For example, cat and dog, these two
words have a paradigmatic relation
because they are in
the same class of animal.
And in general,
if you replace cat with dog in a sentence,
the sentence would still be a valid
sentence that you can make sense of.
Similarly Monday and
Tuesday have paradigmatical relation.
The second kind of relation is
called syntagmatical relation.
In this case, the two words that have this
relation, can be combined with each other.
So A and B have syntagmatic relation if
they can be combined with each other in
a sentence, that means these two
words are semantically related.
So for example, cat and sit are related
because a cat can sit somewhere.
Similarly, car and
drive are related semantically and
they can be combined with
each other to convey meaning.
However, in general, we can not
replace cat with sit in a sentence or
car with drive in the sentence
to still get a valid sentence,
meaning that if we do that, the sentence
will become somewhat meaningless.
So this is different from
paradigmatic relation.
And these two relations are in fact so
fundamental that they can be
generalized to capture basic relations
between units in arbitrary sequences.
And definitely they can be
generalized to describe
relations of any items in a language.
So, A and B don't have to be words and
they can be phrases, for example.
And they can even be more complex
phrases than just a non-phrase.
If you think about the general
problem of the sequence mining
then we can think about the units
being and the sequence data.
Then we think of paradigmatic
relation as relations that
are applied to units that tend to occur
in a singular locations in a sentence,
or in a sequence of data
elements in general.
So they occur in similar locations
relative to the neighbors in the sequence.
Syntagmatical relation on
the other hand is related to
co-occurrent elements that tend
to show up in the same sequence.
So these two are complimentary and
are basic relations of words.
And we're interested in discovering
them automatically from text data.
Discovering such worded
relations has many applications.
First, such relations can be directly
useful for improving accuracy of many NLP
tasks, and this is because this is part
of our knowledge about a language.
So if you know these two words
are synonyms, for example,
and then you can help a lot of tasks.
And grammar learning can be also
done by using such techniques.
Because if we can learn
paradigmatic relations,
then we form classes of words,
syntactic classes for example.
And if we learn syntagmatic relations,
then we would be able to know
the rules for putting together a larger
expression based on component expressions.
So we learn the structure and
what can go with what else.
Word relations can be also very useful for
many applications in text retrieval and
mining.
For example, in search and
text retrieval, we can use word
associations to modify a query,
and this can be used to
introduce additional related words into
a query and make the query more effective.
It's often called a query expansion.
Or you can use related words to
suggest related queries to the user
to explore the information space.
Another application is to
use word associations to
automatically construct the top
of the map for browsing.
We can have words as nodes and
associations as edges.
A user could navigate from
one word to another to
find information in the information space.
Finally, such word associations can also
be used to compare and summarize opinions.
For example, we might be interested
in understanding positive and
negative opinions about the iPhone 6.
In order to do that, we can look at what
words are most strongly associated with
a feature word like battery in
positive versus negative reviews.
Such a syntagmatical
relations would help us
show the detailed opinions
about the product.
So, how can we discover such
associations automatically?
Now, here are some intuitions
about how to do that.
Now let's first look at
the paradigmatic relation.
Here we essentially can take
advantage of similar context.
So here you see some simple
sentences about cat and dog.
You can see they generally
occur in similar context,
and that after all is the definition
of paradigmatic relation.
On the right side you can kind
of see I extracted expressly
the context of cat and
dog from this small sample of text data.
I've taken away cat and
dog from these sentences, so
that you can see just the context.
Now, of course we can have different
perspectives to look at the context.
For example, we can look at
what words occur in the left
part of this context.
So we can call this left context.
What words occur before we see cat or dog?
So, you can see in this case, clearly
dog and cat have similar left context.
You generally say his cat or my cat and
you say also, my dog and his dog.
So that makes them similar
in the left context.
Similarly, if you look at the words
that occur after cat and dog,
which we can call right context,
they are also very similar in this case.
Of course, it's an extreme case,
where you only see eats.
And in general,
you'll see many other words, of course,
that can't follow cat and dog.
You can also even look
at the general context.
And that might include all
the words in the sentence or
in sentences around this word.
And even in the general context, you also
see similarity between the two words.
So this was just a suggestion
that we can discover paradigmatic
relation by looking at
the similarity of context of words.
So, for example,
if we think about the following questions.
How similar are context of cat and
context of dog?
In contrast how similar are context
of cat and context of computer?
Now, intuitively,
we're to imagine the context of cat and
the context of dog would
be more similar than
the context of cat and
context of the computer.
That means, in the first case
the similarity value would be high,
between the context of cat and
dog, where as in the second,
the similarity between context of cat and
computer would be low
because they all not having a paradigmatic
relationship and imagine what words
occur after computer in general.
It would be very different from
what words occur after cat.
So this is the basic idea of what
this covering, paradigmatic relation.
What about the syntagmatic relation?
Well, here we're going to explore
the correlated occurrences,
again based on the definition
of syntagmatic relation.
Here you see the same sample of text.
But here we're interested in knowing
what other words are correlated
with the verb eats and
what words can go with eats.
And if you look at the right
side of this slide and
you see,
I've taken away the two words around eats.
I've taken away the word to its left and
also the word to its
right in each sentence.
And then we ask the question, what words
tend to occur to the left of eats?
And what words tend to
occur to the right of eats?
Now thinking about this question
would help us discover syntagmatic
relations because syntagmatic relations
essentially captures such correlations.
So the important question to ask for
syntagmatical relation is,
whenever eats occurs,
what other words also tend to occur?
So the question here has
to do with whether there
are some other words that tend
to co-occur together with each.
Meaning that whenever you see eats
you tend to see the other words.
And if you don't see eats, probably,
you don't see other words often either.
So this intuition can help
discover syntagmatic relations.
Now again, consider example.
How helpful is occurrence of eats for
predicting occurrence of meat?
Right.
All right, so knowing whether eats occurs
in a sentence would generally help us
predict whether meat also occurs indeed.
And if we see eats occur in the sentence,
and
that should increase the chance
that meat would also occur.
In contrast,
if you look at the question in the bottom,
how helpful is the occurrence of eats for
predicting of occurrence of text?
Because eats and
text are not really related, so
knowing whether eats occurred
in the sentence doesn't
really help us predict the weather,
text also occurs in the sentence.
So this is in contrast to
the question about eats and meat.
This also helps explain that intuition
behind the methods of what
discovering syntagmatic relations.
Mainly we need to capture the correlation
between the occurrences of two words.
So to summarize the general ideas for
discovering word associations
are the following.
For paradigmatic relation,
we present each word by its context.
And then compute its context similarity.
We're going to assume the words
that have high context similarity
to have paradigmatic relation.
For syntagmatic relation, we will count
how many times two words occur together
in a context, which can be a sentence,
a paragraph, or a document even.
And we're going to compare
their co-occurrences with
their individual occurrences.
We're going to assume words
with high co-occurrences but
relatively low individual occurrences
to have syntagmatic relations
because they attempt to occur together and
they don't usually occur alone.
Note that the paradigmatic relation and
the syntagmatic relation
are actually closely related
in that paradigmatically
related words tend to have syntagmatic
relation with the same word.
They tend to be associated
with the same word, and
that suggests that we can also do join
the discovery of the two relations.
So these general ideas can be
implemented in many different ways.
And the course won't cover all of them,
but
we will cover at least some of
the methods that are effective for
discovering these relations.
[MUSIC]

[SOUND]
This
lecture is about
the Paradigmatics Relation Discovery.
In this lecture we are going to talk about
how to discover a particular kind of word
association called
a paradigmatical relation.
By definition,
two words are paradigmatically
related if they share a similar context.
Namely, they occur in
similar positions in text.
So naturally our idea of discovering such
a relation is to look at the context
of each word and then try to compute
the similarity of those contexts.
So here is an example of
context of a word, cat.
Here I have taken the word
cat out of the context and
you can see we are seeing some remaining
words in the sentences that contain cat.
Now, we can do the same thing for
another word like dog.
So in general we would like to capture
such a context and then try to assess
the similarity of the context of cat and
the context of a word like dog.
So now the question is how can we
formally represent the context and
then define the similarity function.
So first, we note that the context
actually contains a lot of words.
So, they can be regarded as
a pseudo document, a imagine
document, but there are also different
ways of looking at the context.
For example, we can look at the word
that occurs before the word cat.
We can call this context Left1 context.
All right, so in this case you
will see words like my, his, or
big, a, the, et cetera.
These are the words that can
occur to left of the word cat.
So we say my cat, his cat,
big cat, a cat, et cetera.
Similarly, we can also collect the words
that occur right after the word cat.
We can call this context Right1, and
here we see words like eats,
ate, is, has, et cetera.
Or, more generally,
we can look at all the words in
the window of text around the word cat.
Here, let's say we can take a window
of 8 words around the word cat.
We call this context Window8.
Now, of course, you can see all
the words from left or from right, and
so we'll have a bag of words in
general to represent the context.
Now, such a word based representation
would actually give us
an interesting way to define the
perspective of measuring the similarity.
Because if you look at just
the similarity of Left1,
then we'll see words that share
just the words in the left context,
and we kind of ignored the other words
that are also in the general context.
So that gives us one perspective to
measure the similarity, and similarly,
if we only use the Right1 context,
we will capture this narrative
from another perspective.
Using both the Left1 and
Right1 of course would allow us to capture
the similarity with even
more strict criteria.
So in general, context may contain
adjacent words, like eats and
my, that you see here, or
non-adjacent words, like Saturday,
Tuesday, or
some other words in the context.
And this flexibility also allows us
to match the similarity in somewhat
different ways.
Sometimes this is useful,
as we might want to capture
similarity base on general content.
That would give us loosely
related paradigmatical relations.
Whereas if you use only the words
immediately to the left and
to the right of the word, then you
likely will capture words that are very
much related by their syntactical
categories and semantics.
So the general idea of discovering
paradigmatical relations
is to compute the similarity
of context of two words.
So here, for example,
we can measure the similarity of cat and
dog based on the similarity
of their context.
In general, we can combine all
kinds of views of the context.
And so the similarity function is,
in general,
a combination of similarities
on different context.
And of course, we can also assign
weights to these different
similarities to allow us to focus
more on a particular kind of context.
And this would be naturally
application specific, but again,
here the main idea for discovering
pardigmatically related words is
to computer the similarity
of their context.
So next let's see how we exactly
compute these similarity functions.
Now to answer this question,
it is useful to think of bag of words
representation as vectors
in a vector space model.
Now those of you who have been
familiar with information retrieval or
textual retrieval techniques would
realize that vector space model has
been used frequently for
modeling documents and queries for search.
But here we also find it convenient
to model the context of a word for
paradigmatic relation discovery.
So the idea of this
approach is to view each
word in our vocabulary as defining one
dimension in a high dimensional space.
So we have N words in
total in the vocabulary,
then we have N dimensions,
as illustrated here.
And on the bottom, you can see a frequency
vector representing a context,
and here we see where eats
occurred 5 times in this context,
ate occurred 3 times, et cetera.
So this vector can then be placed
in this vector space model.
So in general,
we can represent a pseudo document or
context of cat as one vector,
d1, and another word,
dog, might give us a different context,
so d2.
And then we can measure
the similarity of these two vectors.
So by viewing context in
the vector space model,
we convert the problem of
paradigmatical relation discovery
into the problem of computing
the vectors and their similarity.
So the two questions that we
have to address are first,
how to compute each vector, and
that is how to compute xi or yi.
And the other question is how
do you compute the similarity.
Now in general, there are many approaches
that can be used to solve the problem, and
most of them are developed for
information retrieval.
And they have been shown to work well for
matching a query vector and
a document vector.
But we can adapt many of
the ideas to compute a similarity
of context documents for our purpose here.
So let's first look at
the one plausible approach,
where we try to match
the similarity of context based on
the expected overlap of words,
and we call this EOWC.
So the idea here is to represent
a context by a word vector
where each word has a weight
that's equal to the probability
that a randomly picked word from
this document vector, is this word.
So in other words,
xi is defined as the normalized
account of word wi in the context, and
this can be interpreted as
the probability that you would
actually pick this word from d1
if you randomly picked a word.
Now, of course these xi's would sum to one
because they are normalized frequencies,
and this means the vector is
actually probability of
the distribution over words.
So, the vector d2 can be also
computed in the same way, and
this would give us then two probability
distributions representing two contexts.
So, that addresses the problem
how to compute the vectors, and
next let's see how we can define
similarity in this approach.
Well, here, we simply define
the similarity as a dot product of two
vectors, and
this is defined as a sum of the products
of the corresponding
elements of the two vectors.
Now, it's interesting to see
that this similarity function
actually has a nice interpretation,
and that is this.
Dot product, in fact that gives
us the probability that two
randomly picked words from
the two contexts are identical.
That means if we try to pick a word
from one context and try to pick another
word from another context, we can then
ask the question, are they identical?
If the two contexts are very similar,
then we should expect we frequently will
see the two words picked from
the two contexts are identical.
If they are very different,
then the chance of seeing
identical words being picked from
the two contexts would be small.
So this intuitively makes sense, right,
for measuring similarity of contexts.
Now you might want to also take
a look at the exact formulas and
see why this can be interpreted
as the probability that
two randomly picked words are identical.
So if you just stare at the formula
to check what's inside this sum,
then you will see basically in each
case it gives us the probability that
we will see an overlap on
a particular word, wi.
And where xi gives us a probability that
we will pick this particular word from d1,
and yi gives us the probability
of picking this word from d2.
And when we pick the same
word from the two contexts,
then we have an identical pick, right so.
That's one possible approach, EOWC,
extracted overlap of words in context.
Now as always, we would like to assess
whether this approach it would work well.
Now of course, ultimately we have to
test the approach with real data and
see if it gives us really
semantically related words.
Really give us paradigmatical relations,
but
analytically we can also analyze
this formula a little bit.
So first, as I said,
it does make sense, right, because this
formula will give a higher score if there
is more overlap between the two contexts.
So that's exactly what we want.
But if you analyze
the formula more carefully,
then you also see there might
be some potential problems,
and specifically there
are two potential problems.
First, it might favor matching
one frequent term very well,
over matching more distinct terms.
And that is because in the dot product,
if one element has a high value and this
element is shared by both contexts and
it contributes a lot to the overall sum,
it might indeed make the score
higher than in another case,
where the two vectors actually have
a lot of overlap in different terms.
But each term has a relatively low
frequency, so this may not be desirable.
Of course, this might be
desirable in some other cases.
But in our case, we should intuitively
prefer a case where we match
more different terms in the context,
so that we have more confidence
in saying that the two words
indeed occur in similar context.
If you only rely on one term and
that's a little bit questionable,
it may not be robust.
Now the second problem is that it
treats every word equally, right.
So if you match a word like the and
it will be the same as
matching a word like eats, but
intuitively we know
matching the isn't really
surprising because the occurs everywhere.
So matching the is not as such
strong evidence as matching what
a word like eats,
which doesn't occur frequently.
So this is another
problem of this approach.
In the next chapter we are going to talk
about how to address these problems.
[MUSIC]

[SOUND] In this lecture
we continue discussing
Paradigmatic Relation Discovery.
Earlier we introduced a method called
Expected Overlap of Words in Context.
In this method we represent each
context by a word of vector
that represents the probability
of a word in the context.
And we measure the similarity by using the
dot product which can be interpreted as
the probability that two randomly picked
words from the two contexts are identical.
We also discussed the two
problems of this method.
The first is that it favors
matching one frequent term
very well over matching
more distinct terms.
It put too much emphasis on
matching one term very well.
The second is that it
treats every word equally.
Even a common word like
the would contribute equally
as content word like eats.
So now we are going to talk about
how to solve this problems.
More specifically we're going to
introduce some retrieval heuristics
used in text retrieval and these
heuristics can effectively solve these
problems as these problems also occur
in text retrieval when we match a query
with a document, so
to address the first problem,
we can use a sublinear
transformation of term frequency.
That is, we don't have to use raw
frequency count of the term to represent
the context.
We can transform it into some form that
wouldn't emphasize so much on the raw
frequency to address the problem,
we can put more weight on rare terms.
And that is,
we ran reward a matching a rare word.
And this heuristic is called IDF
term weighting in text retrieval.
IDF stands for inverse document frequency.
So now we're going to talk about
the two heuristics in more detail.
First, let's talk about
the TF transformation.
That is, it'll convert the raw count of
a word in the document into some weight
that reflects our belief about
how important this wording.
The document.
And so,
that would be denoted by TF of w and d.
That's shown in the Y axis.
Now, in general,
there are many ways to map that.
And let's first look at
the the simple way of mapping.
In this case, we're going to say, well,
any non zero counts will be mapped to one.
And the zero count will be mapped to zero.
So with this mapping, all the frequencies
will be mapped to only two values,
zero or one.
And the mapping function is
shown here as a flat line here.
This is naive because in order
the frequency of words, however,
this actually has
advantage of emphasizing,
matching all the words in the context.
It does not allow a frequent
word to dominate the match now
the approach that we have taken earlier
in the overlap account approach
is a linear transformation we
basically take y as the same as x so
we use the raw count as
a representation and
that created the problem
that we just talked about.
Namely, it emphasizes too much
on matching one frequent term.
Matching one frequent term
can contribute a lot.
We can have a lot of other interesting
transformations in between
the two extremes.
And they generally form
a sub linear transformation.
So for example,
one a logarithm of the row count.
And this will give us curve that looks
like this that you are seeing here.
In this case,
you can see the high frequency counts.
The high counts are penalized
a little bit all right,
so the curve is a sub linear curve.
And it brings down the weight
of those really high counts.
And this what we want because it prevents
that kind of terms from
dominating the scoring function.
Now, there is also another interesting
transformation called a BM25
transformation, which as been shown
to be very effective for retrieval.
And in this transformation we
have a form that looks like this.
So it's k plus one multiplies by x,
divided by x plus k.
Where k is a parameter.
X is the count.
The raw count of a word.
Now the transformation is very
interesting, in that it can actually
kind of go from one extreme to
the other extreme by varying k,
and it also is interesting that it
has upper bound, k + 1 in this case.
So, this puts a very strict
constraint on high frequency terms,
because their weight
will never exceed k + 1.
As we vary k,
we can simulate the two extremes.
So, when is set to zero,
we roughly have the zero one vector.
Whereas, when we set the k
to a very large value,
it will behave more like,
immediate transformation.
So this transformation function is by far
the most effective transformation function
for tax and retrieval, and it also
makes sense for our problem set up.
So we just talked about how to solve the
problem of overemphasizing a frequently,
a frequently tongue.
Now let's look at the second problem, and
that is how we can penalize popular terms,
matching the is not surprising
because the occurs everywhere.
But matching eats would count a lot so
how can we address that problem.
In this case we can use the IDF weight.
Pop that's commonly used in retrieval.
IDF stands for inverse document frequency.
Now frequency means the count of
the total number of documents
that contain a particular word.
So here we show that the IDF measure
is defined as a logarithm function
of the number of documents that
match a term or document frequency.
So, k is the number of documents
containing a word, or document frequency.
And M here is the total number
of documents in the collection.
The IDF function is
giving a higher value for
a lower k,
meaning that it rewards a rare term, and
the maximum value is log of M+1.
That's when the word occurred just once in
the context, so that's a very rare term.
The rarest term in the whole collection.
The lowest value you can see here is when
K reaches its maximum, which would be M.
All right so,
that would be a very low value,
close to zero in fact.
So, this of course measure
is used in search.
Where we naturally have a collection.
In our case, what would be our collection?
Well, we can also use the context
that we had collected for
all the words as our collection.
And that is to say, a word that's
populating the collection in general.
Would also have a low
IDF because depending
on the dataset we can Construct
the context vectors in the different ways.
But in the end, if a term is very
frequently original data set.
Then it will still be frequenting
the collective context documents.
So how can we add these
heuristics to improve our
similarity function well here's one way.
And there are many other
ways that are possible.
But this is a reasonable way.
Where we can adapt the BM25
retrieval model for
paradigmatic relation mining.
So here, we define,
in this case we define
the document vector as
containing elements representing
normalized BM25 values.
So in this normalization function, we see,
we take a sum over, sum of all the words.
And we normalize the weight
of each word by the sum of
the weights of all the words.
And this is to, again, ensure all
the xi's will sum to 1 in this vector.
So this would be very similar
to what we had before,
in that this vector is actually something
similar to a word distribution.
Or the xis with sum to 1.
Now the weight of BM25 for
each word is defined here.
And if you compare this with our old
definition where we just have a normalized
count, of this one so
we only have this one and
the document lens of
the total counts of words.
Being that context document and
that's what we had before.
But now with the BM25 transformation,
we're introduced to something else.
First off, because this extra occurrence
of this count is just to achieve
the of normalization.
But we also see we introduced
the parameter k here.
And this parameter is generally non active
number although zero is also possible.
This controls the upper bound and
the kind of all
to what extent it simulates
the linear transformation.
And so this is one parameter, but we also
see there was another parameter here, B.
And this would be within 0 an 1.
And this is a parameter to
control length] normalization.
And in this case, the normalization
formula has average document length here.
And this is computed by
taking the average of
the lengths of all the documents
in the collection.
In this case, all the lengths
of all the context documents.
That we are considering.
So this average document will be
a constant for any given collection.
So it actually is only
affecting the factor of
the parameter b here
because this is a constant.
But I kept it here because it's
constant and that's useful
in retrieval where it would give us a
stabilized interpretation of parameter B.
But, for
our purpose it would be a constant.
So it would only be affecting the length
normalization together with parameter b.
Now with this definition then, we have a
new way to define our document of vectors.
And we can compute the vector
d2 in the same way.
The difference is that the high
frequency terms will now have a somewhat
lower weight.
And this would help us control the
influence of these high frequency terms.
Now, the idea can be added
here in the scoring function.
That means we will introduce a way for
matching each time.
You may recall, this is sum that indicates
all the possible words that can be
overlapped between the two contacts.
And the Xi and the Yi are probabilities
of picking the word from both context,
therefore,
it indicates how likely we'll
see a match on this word.
Now, IDF would give us the importance
of matching this word.
A common word will be worth
less than a rare word, and so
we emphasize more on
matching rare words now.
So, with this modification,
then the new function.
When likely to address those two problems.
Now interestingly,
we can also use this approach to
discover syntagmatical relations.
In general,
when we represent a term vector to replant
a context with a term
vector we would likely see,
some terms have higher weights, and
other terms have lower weights.
Depending on how we assign
weights to these terms,
we might be able to use
these weights to discover
the words that are strongly associated
with a candidate of word in the context.
It's interesting that we can
also use this context for
similarity function based on BM25
to discover syntagmatic relations.
So, the idea is to use the converted
implantation of the context.
To see which terms are scored high.
And if a term has high weight,
then that term might be more strongly
related to the candidate word.
So let's take a look at
the vector in more detail here.
And we have
each Xi defined as
a normalized weight of BM25.
Now this weight alone only reflects how
frequent the word occurs in the context.
But, we can't just say an infrequent
term in the context would be
correlated with the candidate word
because many common words like the will
occur frequently out of context.
But if we apply IDF
weighting as you see here,
we can then re weigh
these terms based on IDF.
That means the words that are common,
like the, will get penalized.
so now the highest weighted terms will not
be those common terms because they have
lower IDFs.
Instead, those terms would be the terms
that are frequently in the context but
not frequent in the collection.
So those are clearly the words
that tend to occur in the context
of the candidate word, for example, cat.
So, for this reason, the highly weighted
terms in this idea of weighted vector
can also be assumed to be candidates for
syntagmatic relations.
Now, of course, this is only
a byproduct of how approach is for
discovering parathmatic relations.
And in the next lecture,
we're going to talk more about how
to discover syntagmatic relations.
But it clearly shows the relation
between discovering the two relations.
And indeed they can be discussed.
Discovered in a joined
manner by leveraging
such associations, namely syntactical
relation words that are similar in,
yeah it also shows the relation between
syntagmatic relation discovery and
the paradgratical relations discovery.
We may be able to leverage the relation to
join the discovery of
two kinds of relations.
This also shows some interesting
connections between the discovery of
syntagmatic relation and
the paradigmatic relation.
Specifically those words that
are paradigmatic related tend to be
having a syntagmatic
relation with the same word.
So to summarize the main idea of what
is covering paradigmatic relations
is to collect the context of a candidate
word to form a pseudo document,
and this is typically
represented as a bag of words.
And then compute similarity of
the corresponding context documents
of two candidate words.
And then we can take the highly
similar word pairs and
treat them as having
paradigmatic relations.
These are the words that
share similar contexts.
There are many different ways
to implement this general idea,
and we just talked about
some of the approaches, and
more specifically we talked about
using text retrieval models to help
us design effective similarity function
to compute the paradigmatic relations.
More specifically we
have used the BM25 and
IDF weighting to discover
paradigmatic relation.
And these approaches also
represent the state of the art.
In text retrieval techniques.
Finally, syntagmatic relations
can also be discovered as a by
product when we discover
paradigmatic relations.
[MUSIC]

[SOUND]
>> This
lecture is about topic mining and
analysis.
We're going to talk about its
motivation and task definition.
In this lecture we're going to talk
about different kind of mining task.
As you see on this road map,
we have just covered
mining knowledge about language,
namely discovery of
word associations such as paradigmatic and
relations and syntagmatic relations.
Now, starting from this lecture, we're
going to talk about mining another kind of
knowledge, which is content mining, and
trying to discover knowledge about
the main topics in the text.
And we call that topic mining and
analysis.
In this lecture, we're going to talk about
its motivation and the task definition.
So first of all,
let's look at the concept of topic.
So topic is something that we
all understand, I think, but
it's actually not that
easy to formally define.
Roughly speaking, topic is the main
idea discussed in text data.
And you can think of this as a theme or
subject of a discussion or conversation.
It can also have different granularities.
For example,
we can talk about the topic of a sentence.
A topic of article,
aa topic of paragraph or
the topic of all the research articles
in the research library, right,
so different grand narratives of topics
obviously have different applications.
Indeed, there are many applications that
require discovery of topics in text, and
they're analyzed then.
Here are some examples.
For example, we might be interested
in knowing about what are Twitter
users are talking about today?
Are they talking about NBA sports, or
are they talking about some
international events, etc.?
Or we are interested in
knowing about research topics.
For example, one might be interested in
knowing what are the current research
topics in data mining, and how are they
different from those five years ago?
Now this involves discovery of topics
in data mining literatures and
also we want to discover topics in
today's literature and those in the past.
And then we can make a comparison.
We might also be also interested in
knowing what do people like about
some products like the iPhone 6,
and what do they dislike?
And this involves discovering
topics in positive opinions about
iPhone 6 and
also negative reviews about it.
Or perhaps we're interested in knowing
what are the major topics debated in 2012
presidential election?
And all these have to do with discovering
topics in text and analyzing them,
and we're going to talk about a lot
of techniques for doing this.
In general we can view a topic as
some knowledge about the world.
So from text data we expect to
discover a number of topics, and
then these topics generally provide
a description about the world.
And it tells us something about the world.
About a product, about a person etc.
Now when we have some non-text data,
then we can have more context for
analyzing the topics.
For example, we might know the time
associated with the text data, or
locations where the text
data were produced,
or the authors of the text, or
the sources of the text, etc.
All such meta data, or
context variables can be associated
with the topics that we discover, and
then we can use these context variables
help us analyze patterns of topics.
For example, looking at topics over time,
we would be able to discover
whether there's a trending topic, or
some topics might be fading away.
Soon you are looking at topics
in different locations.
We might know some insights about
people's opinions in different locations.
So that's why mining
topics is very important.
Now, let's look at the tasks
of topic mining and analysis.
In general, it would involve first
discovering a lot of topics, in this case,
k topics.
And then we also would like to know, which
topics are covered in which documents,
to what extent.
So for example, in document one, we
might see that Topic 1 is covered a lot,
Topic 2 and
Topic k are covered with a small portion.
And other topics,
perhaps, are not covered.
Document two, on the other hand,
covered Topic 2 very well,
but it did not cover Topic 1 at all, and
it also covers Topic k to some extent,
etc., right?
So now you can see there
are generally two different tasks, or
sub-tasks, the first is to discover k
topics from a collection of text laid out.
What are these k topics?
Okay, major topics in the text they are.
The second task is to figure out
which documents cover which topics
to what extent.
So more formally,
we can define the problem as follows.
First, we have, as input,
a collection of N text documents.
Here we can denote the text
collection as C, and
denote text article as d i.
And, we generally also need to have
as input the number of topics, k.
But there may be techniques that can
automatically suggest a number of topics.
But in the techniques that we will
discuss, which are also the most useful
techniques, we often need to
specify a number of topics.
Now the output would then be the k
topics that we would like to discover,
in order as theta sub
one through theta sub k.
Also we want to generate the coverage of
topics in each document of d sub i And
this is denoted by pi sub i j.
And pi sub ij is the probability
of document d sub i
covering topic theta sub j.
So obviously for each document, we have
a set of such values to indicate to
what extent the document covers,
each topic.
And we can assume that these
probabilities sum to one.
Because a document won't be able to cover
other topics outside of the topics
that we discussed, that we discovered.
So now, the question is, how do we define
theta sub i, how do we define the topic?
Now this problem has not
been completely defined
until we define what is exactly theta.
So in the next few lectures,
we're going to talk about
different ways to define theta.
[MUSIC]

[MUSIC]
This lecture is about the expectation
maximization algorithms or
also called the EM algorithms.
In this lecture,
we're going to continue the discussion
of probabilistic topic models.
In particular,
we're going to introduce the EM algorithm.
Which is a family of useful algorithms for
computing the maximum life or
estimate of mixture models.
So, this is now a familiar scenario
of using two components, the mixture
model to try to fact out the background
words from one topic or word distribution.
Yeah.
So, we're interested in computing
this estimate and
we're going to try to adjust these
probability values to maximize
the probability of the observed documents.
And know that we're assumed all
the other parameters are known.
So, the only thing unknown is these water
properties, this given by zero something.
And in this lecture, we're going to look
into how to compute this maximum like or
estimate.
Now this started with the idea of
separating the words in
the text data into two groups.
One group will be explained
by the background model.
The other group will be explained
by the unknown topical order.
After all this is the basic
idea of the mixture model.
But, suppose we actually know which
word is from which distribution.
So that would mean, for example,
these words, the, is, and
we, are known to be from this
background origin, distribution.
On the other hand,
the other words, text mining,
clustering, etcetera are known to be
from the topic word, distribution.
If you can see the color,
that these are showing blue.
These blue words are, they are assumed
to be from the topic word, distribution.
If we already know how
to separate these words.
Then the problem of estimating
the word distribution
would be extremely simple, right?
If you think about this for
a moment, you'll realize that, well,
we can simply take all these
words that are known to be from
this word distribution,
see that's a d and normalize them.
So indeed this problem would be
very easy to solve if we had known
which words are from which
it is written precisely.
And this is in fact,
making this model no longer a mystery
model because we can already observe which
of these distribution has been used
to generate which part of the data.
So we, actually go back to the single
order distribution problem.
And in this case, let's call these words
that are known to be from theta d,
a pseudo document of d prime.
And now all we have to do is
just normalize these word
accounts for each word, w sub i.
And that's fairly straightforward,
and it's just dictated by
the maximum estimator.
Now, this idea, however,
doesn't work because we in practice,
don't really know which word
is from which distribution.
But this gives us an idea of perhaps
we can guess which word is
from which distribution.
Specifically, given all the parameters,
can we infer the distribution
a word is from?
So let's assume that we actually
know tentative probabilities for
these words in theta sub d.
So now all the parameters are known for
this mystery model.
Now let's consider word, like a text.
So the question is,
do you think text is more likely,
having been generated from theta sub d or
from theta sub b?
So, in other words,
we are to infer which distribution
has been used to generate this text.
Now, this inference process is a typical
of basing an inference situation,
where we have some prior about
these two distributions.
So can you see what is our prior here?
Well, the prior here is the probability
of each distribution, right.
So the prior is given by
these two probabilities.
In this case, the prior is saying
that each model is equally likely.
But we can imagine perhaps
a different apply is possible.
So this is called a pry
because this is our guess
of which distribution has been
used to generate the word.
Before we even observed the word.
So that's why we call it a pry.
If we don't observe the word we don't
know what word has been observed.
Our best guess is to say,
well, they're equally likely.
So it's just like flipping a coin.
Now in basic inference,
we typical them with our belief
after we have observed the evidence.
So what is the evidence here?
Well, the evidence here is the word text.
Now that we know we're
interested in the word text.
So text can be regarded as evidence.
And if we use base
rule to combine the prior and
the theta likelihood,
what we will end up with
is to combine the prior
with the likelihood that you see here.
Which is basically the probability of
the word text from each distribution.
And we see that in both
cases text is possible.
Note that even in the background
it is still possible,
it just has a very small probability.
So intuitively what would be
your guess seeing this case?
Now if you're like many others,
you would guess text is probably
from c.subd it's more likely from c.subd,
why?
And you will probably see
that it's because text has
a much higher probability
here by the C now sub D than
by the background model which
has a very small probability.
And by this we're going to say well,
text is more likely from theta sub d.
So you see our guess of which
distributing has been used with
the generated text would depend on
how high the probability of the data,
the text, is in each word distribution.
We can do tentative guess that
distribution that gives is a word
higher probability.
And this is likely to
maximize the likelihood.
All right, so we are going to choose
a word that has a higher likelihood.
So, in other words we are going to
compare these two probabilities
of the word given by each
of these distributions.
But our guess must also
be affected by the prior.
So we also need to
compare these two priors.
Why?
Because imagine if we
adjust these probabilities.
We're going to say,
the probability of choosing
a background model is almost 100%.
Now if we have that kind of strong prior,
then that would affect your gas.
You might think,
well, wait a moment, maybe texter could
have been from the background as well.
Although the probability is very
small here the prior is very high.
So in the end, we have to combine the two.
And the base formula
provides us a solid and
principle way of making this
kind of guess to quantify that.
So more specifically, let's think about
the probability that this word text
has been generated in
fact from theta sub d.
Well, in order for text to be generated
from theta sub d, two things must happen.
First, the theta sub d
must have been selected.
So, we have the selection
probability here.
And secondly we also have to actually have
observed the text from the distribution.
So, when we multiply the two together,
we get the probability
that text has in fact been
generated from zero sub d.
Similarly, for the background model and
the probability of generating text
is another product of similar form.
Now we also introduced late in
the variable z here to denote
whether the word is from the background or
the topic.
When z is 0, it means it's from the topic,
theta sub d.
When it's 1, it means it's from
the background, theta sub B.
So now we have the probability
that text is generated from each,
then we can simply normalize
them to have estimate
of the probability that
the word text is from
theta sub d or from theta sub B.
And equivalently the probability
that Z is equal to zero,
given that the observed evidence is text.
So this is application of base rule.
But this step is very crucial for
understanding the EM hours.
Because if we can do this,
then we would be able to first,
initialize the parameter
values somewhat randomly.
And then, we're going to take
a guess of these Z values and
all, which distributing has been
used to generate which word.
And the initialize the parameter values
would allow us to have a complete
specification of the mixture model,
which allows us to apply Bayes'
rule to infer which distribution is
more likely to generate each word.
And this prediction essentially helped us
to separate words from
the two distributions.
Although we can't separate them for sure,
but we can separate then
probabilistically as shown here.
[MUSIC]

[SOUND]
So
this is indeed a general idea of
the Expectation-Maximization, or EM,
Algorithm.
So in all the EM algorithms we
introduce a hidden variable
to help us solve the problem more easily.
In our case the hidden variable
is a binary variable for
each occurrence of a word.
And this binary variable would
indicate whether the word has
been generated from 0 sub d or 0 sub p.
And here we show some possible
values of these variables.
For example, for the it's from background,
the z value is one.
And text on the other hand.
Is from the topic then it's zero for
z, etc.
Now, of course, we don't observe these z
values, we just imagine they're all such.
Values of z attaching to other words.
And that's why we call
these hidden variables.
Now, the idea that we
talked about before for
predicting the word distribution that
has been used when we generate the word
is it a predictor,
the value of this hidden variable?
And, so, the EM algorithm then,
would work as follows.
First, we'll initialize all
the parameters with random values.
In our case,
the parameters are mainly the probability.
of a word, given by theta sub d.
So this is an initial addition stage.
These initialized values would allow
us to use base roll to take a guess
of these z values, so
we'd guess these values.
We can't say for sure whether
textt is from background or not.
But we can have our guess.
This is given by this formula.
It's called an E-step.
And so the algorithm would then try to
use the E-step to guess these z values.
After that, it would then invoke
another that's called M-step.
In this step we simply take advantage
of the inferred z values and
then just group words that are in
the same distribution like these
from that ground including this as well.
We can then normalize the count
to estimate the probabilities or
to revise our estimate of the parameters.
So let me also illustrate
that we can group the words
that are believed to have
come from zero sub d, and
that's text, mining algorithm,
for example, and clustering.
And we group them together to help us
re-estimate the parameters
that we're interested in.
So these will help us
estimate these parameters.
Note that before we just set
these parameter values randomly.
But with this guess, we will have
somewhat improved estimate of this.
Of course, we don't know exactly
whether it's zero or one.
So we're not going to really
do the split in a hard way.
But rather we're going to
do a softer split.
And this is what happened here.
So we're going to adjust the count by
the probability that would believe
this word has been generated
by using the theta sub d.
And you can see this,
where does this come from?
Well, this has come from here, right?
From the E-step.
So the EM Algorithm would
iteratively improve uur initial
estimate of parameters by using
E-step first and then M-step.
The E-step is to augment the data
with additional information, like z.
And the M-step is to take advantage
of the additional information
to separate the data.
To split the data accounts and
then collect the right data accounts to
re-estimate our parameter.
And then once we have a new generation of
parameter, we're going to repeat this.
We are going the E-step again.
To improve our estimate
of the hidden variables.
And then that would lead to another
generation of re-estimated parameters.
For the word distribution
that we are interested in.
Okay, so, as I said,
the bridge between the two
is really the variable z, hidden variable,
which indicates how likely
this water is from the top water
distribution, theta sub p.
So, this slide has a lot of content and
you may need to.
Pause the reader to digest it.
But this basically captures
the essence of EM Algorithm.
Start with initial values that
are often random themself.
And then we invoke E-step followed
by M-step to get an improved
setting of parameters.
And then we repeated this, so
this a Hill-Climbing algorithm
that would gradually improve
the estimate of parameters.
As I will explain later
there is some guarantee for
reaching a local maximum of
the log-likelihood function.
So lets take a look at the computation for
a specific case, so
these formulas are the EM.
Formulas that you see before, and
you can also see there are superscripts,
here, like here, n,
to indicate the generation of parameters.
Like here for example we have n plus one.
That means we have improved.
From here to here we have an improvement.
So in this setting we have assumed the two
numerals have equal probabilities and
the background model is null.
So what are the relevance
of the statistics?
Well these are the word counts.
So assume we have just four words,
and their counts are like this.
And this is our background model that
assigns high probabilities to common
words like the.
And in the first iteration,
you can picture what will happen.
Well first we initialize all the values.
So here, this probability that we're
interested in is normalized into a uniform
distribution of all the words.
And then the E-step would give us a guess
of the distribution that has been used.
That will generate each word.
We can see we have different
probabilities for different words.
Why?
Well, that's because these words have
different probabilities in the background.
So even though the two
distributions are equally likely.
And then our initial audition say uniform
distribution because of the difference
in the background of the distribution,
we have different guess the probability.
So these words are believed to
be more likely from the topic.
These on the other hand are less likely.
Probably from background.
So once we have these z values,
we know in the M-step these probabilities
will be used to adjust the counts.
So four must be multiplied by this 0.33
in order to get the allocated
accounts toward the topic.
And this is done by this multiplication.
Note that if our guess says this
is 100% If this is one point zero,
then we just get the full count
of this word for this topic.
In general it's not going
to be one point zero.
So we're just going to get some percentage
of this counts toward this topic.
Then we simply normalize these counts
to have a new generation
of parameters estimate.
So you can see, compare this with
the older one, which is here.
So compare this with this one and
we'll see the probability is different.
Not only that, we also see some
words that are believed to have come from
the topic will have a higher probability.
Like this one, text.
And of course, this new generation of
parameters would allow us to further
adjust the inferred latent variable or
hidden variable values.
So we have a new generation of values,
because of the E-step based on
the new generation of parameters.
And these new inferred values
of Zs will give us then
another generation of the estimate
of probabilities of the word.
And so on and so forth so this is what
would actually happen when we compute
these probabilities
using the EM Algorithm.
As you can see in the last row
where we show the log-likelihood,
and the likelihood is increasing
as we do the iteration.
And note that these log-likelihood is
negative because the probability is
between 0 and 1 when you take a logarithm,
it becomes a negative value.
Now what's also interesting is,
you'll note the last column.
And these are the inverted word split.
And these are the probabilities
that a word is believed to
have come from one distribution, in this
case the topical distribution, all right.
And you might wonder whether
this would be also useful.
Because our main goal is to
estimate these word distributions.
So this is our primary goal.
We hope to have a more discriminative
order of distribution.
But the last column is also bi-product.
This also can actually be very useful.
You can think about that.
We want to use, is to for
example is to estimate to what extent this
document has covered background words.
And this, when we add this up or
take the average we will kind of know to
what extent it has covered background
versus content was that are not
explained well by the background.
[MUSIC]

So, I just showed you that empirically
the likelihood will converge,
but theoretically it can also
be proved that EM algorithm will
converge to a local maximum.
So here's just an illustration of what
happened and a detailed explanation.
This required more knowledge about that,
some of that inequalities,
that we haven't really covered yet.
So here what you see is on the X
dimension, we have a c0 value.
This is a parameter that we have.
On the y axis we see
the likelihood function.
So this curve is the original
likelihood function,
and this is the one that
we hope to maximize.
And we hope to find a c0 value
at this point to maximize this.
But in the case of Mitsumoto we can
not easily find an analytic solution
to the problem.
So, we have to resolve
the numerical errors, and
the EM algorithm is such an algorithm.
It's a Hill-Climb algorithm.
That would mean you start
with some random guess.
Let's say you start from here,
that's your starting point.
And then you try to improve
this by moving this to
another point where you can
have a higher likelihood.
So that's the ideal hill climbing.
And in the EM algorithm, the way we
achieve this is to do two things.
First, we'll fix a lower
bound of likelihood function.
So this is the lower bound.
See here.
And once we fit the lower bound,
we can then maximize the lower bound.
And of course, the reason why this works,
is because the lower bound
is much easier to optimize.
So we know our current guess is here.
And by maximizing the lower bound,
we'll move this point to the top.
To here.
Right?
And we can then map to the original
likelihood function, we find this point.
Because it's a lower bound, we are
guaranteed to improve this guess, right?
Because we improve our lower bound and
then the original likelihood
curve which is above this lower bound
will definitely be improved as well.
So we already know it's
improving the lower bound.
So we definitely improve this
original likelihood function,
which is above this lower bound.
So, in our example,
the current guess is parameter value
given by the current generation.
And then the next guess is
the re-estimated parameter values.
From this illustration you
can see the next guess
is always better than the current guess.
Unless it has reached the maximum,
where it will be stuck there.
So the two would be equal.
So, the E-step is basically
to compute this lower bound.
We don't directly just compute
this likelihood function but
we compute the length of
the variable values and
these are basically a part
of this lower bound.
This helps determine the lower bound.
The M-step on the other hand is
to maximize the lower bound.
It allows us to move
parameters to a new point.
And that's why EM algorithm is guaranteed
to converge to a local maximum.
Now, as you can imagine,
when we have many local maxima,
we also have to repeat the EM
algorithm multiple times.
In order to figure out which one
is the actual global maximum.
And this actually in general is a
difficult problem in numeral optimization.
So here for
example had we started from here,
then we gradually just
climb up to this top.
So, that's not optimal, and
we'd like to climb up all the way to here,
so the only way to climb up to this gear
is to start from somewhere here or here.
So, in the EM algorithm, we generally
would have to start from different points
or have some other way to determine
a good initial starting point.
To summarize in this lecture we
introduced the EM algorithm.
This is a general algorithm for computing
maximum maximum likelihood estimate of all
kinds of models, so
not just for our simple model.
And it's a hill-climbing algorithm, so it
can only converge to a local maximum and
it will depend on initial points.
The general idea is that we will have
two steps to improve the estimate of.
In the E-step we roughly [INAUDIBLE]
how many there are by predicting values
of useful hidden variables that we
would use to simplify the estimation.
In our case, this is the distribution
that has been used to generate the word.
In the M-step then we would exploit
such augmented data which would make
it easier to estimate the distribution,
to improve the estimate of parameters.
Here improve is guaranteed in
terms of the likelihood function.
Note that it's not necessary that we
will have a stable convergence of
parameter value even though the likelihood
function is ensured to increase.
There are some properties that have to
be satisfied in order for the parameters
also to convert into some stable value.
Now here data augmentation
is done probabilistically.
That means,
we're not going to just say exactly
what's the value of a hidden variable.
But we're going to have a probability
distribution over the possible values of
these hidden variables.
So this causes a split of counts
of events probabilistically.
And in our case we'll split the word
counts between the two distributions.
[MUSIC]

[SOUND]
This
lecture is about probabilistic and
latent Semantic Analysis or PLSA.
In this lecture we're going to introduce
probabilistic latent semantic analysis,
often called PLSA.
This is the most basic topic model,
also one of the most useful topic models.
Now this kind of models
can in general be used to
mine multiple topics from text documents.
And PRSA is one of the most basic
topic models for doing this.
So let's first examine this power
in the e-mail for more detail.
Here I show a sample article which is
a blog article about Hurricane Katrina.
And I show some simple topics.
For example government response,
flood of the city of New Orleans.
Donation and the background.
You can see in the article we use
words from all these distributions.
So we first for example see there's
a criticism of government response and
this is followed by discussion of flooding
of the city and donation et cetera.
We also see background
words mixed with them.
So the overall of topic analysis here
is to try to decode these topics behind
the text, to segment the topics,
to figure out which words are from which
distribution and to figure out first,
what are these topics?
How do we know there's a topic
about government response.
There's a topic about a flood in the city.
So these are the tasks
at the top of the model.
If we had discovered these
topics can color these words,
as you see here,
to separate the different topics.
Then you can do a lot of things,
such as summarization, or segmentation,
of the topics,
clustering of the sentences etc.
So the formal definition of problem of
mining multiple topics from text is
shown here.
And this is after a slide that you
have seen in an earlier lecture.
So the input is a collection, the number
of topics, and a vocabulary set, and
of course the text data.
And then the output is of two kinds.
One is the topic category,
characterization.
Theta i's.
Each theta i is a word distribution.
And second, it's the topic coverage for
each document.
These are pi sub i j's.
And they tell us which document it covers.
Which topic to what extent.
So we hope to generate these as output.
Because there are many useful
applications if we can do that.
So the idea of PLSA is
actually very similar to
the two component mixture model
that we have already introduced.
The only difference is that we
are going to have more than two topics.
Otherwise, it is essentially the same.
So here I illustrate how we can generate
the text that has multiple topics and
naturally in all cases
of Probabilistic modelling would want
to figure out the likelihood function.
So we would also ask the question,
what's the probability of observing
a word from such a mixture model?
Now if you look at this picture and
compare this with the picture
that we have seen earlier,
you will see the only difference is
that we have added more topics here.
So, before we have just one topic,
besides the background topic.
But now we have more topics.
Specifically, we have k topics now.
All these are topics that we assume
that exist in the text data.
So the consequence is that our switch for
choosing a topic is now a multiway switch.
Before it's just a two way switch.
We can think of it as flipping a coin.
But now we have multiple ways.
First we can flip a coin to decide
whether we're talk about the background.
So it's the background lambda
sub B versus non-background.
1 minus lambda sub B gives
us the probability of
actually choosing a non-background topic.
After we have made this decision,
we have to make another decision to
choose one of these K distributions.
So there are K way switch here.
And this is characterized by pi,
and this sum to one.
This is just the difference of designs.
Which is a little bit more complicated.
But once we decide which distribution to
use the rest is the same we are going to
just generate a word by using one of
these distributions as shown here.
So now lets look at the question
about the likelihood.
So what's the probability of observing
a word from such a distribution?
What do you think?
Now we've seen this
problem many times now and
if you can recall, it's generally a sum.
Of all the different possibilities
of generating a word.
So let's first look at how the word can
be generated from the background mode.
Well, the probability that the word is
generated from the background model
is lambda multiplied by the probability
of the word from the background mode.
Model, right.
Two things must happen.
First, we have to have
chosen the background model,
and that's the probability of lambda,
of sub b.
Then second, we must have actually
obtained the word w from the background,
and that's probability
of w given theta sub b.
Okay, so similarly,
we can figure out the probability of
observing the word from another topic.
Like the topic theta sub k.
Now notice that here's
the product of three terms.
And that's because of the choice
of topic theta sub k,
only happens if two things happen.
One is we decide not to
talk about background.
So, that's a probability
of 1 minus lambda sub B.
Second, we also have to actually choose
theta sub K among these K topics.
So that's probability of theta sub K,
or pi.
And similarly, the probability of
generating a word from the second.
The topic and the first topic
are like what you are seeing here.
And so
in the end the probability of observing
the word is just a sum of all these cases.
And I have to stress again this is a very
important formula to know because this is
really key to understanding all the topic
models and indeed a lot of mixture models.
So make sure that you really
understand the probability
of w is indeed the sum of these terms.
So, next,
once we have the likelihood function,
we would be interested in
knowing the parameters.
All right, so to estimate the parameters.
But firstly,
let's put all these together to have the
complete likelihood of function for PLSA.
The first line shows the probability of a
word as illustrated on the previous slide.
And this is an important
formula as I said.
So let's take a closer look at this.
This actually commands all
the important parameters.
So first of all we see lambda sub b here.
This represents a percentage
of background words
that we believe exist in the text data.
And this can be a known value
that we set empirically.
Second, we see the background
language model, and
typically we also assume this is known.
We can use a large collection of text, or
use all the text that we have available
to estimate the world of distribution.
Now next in the next stop this formula.
[COUGH] Excuse me.
You see two interesting
kind of parameters,
those are the most important parameters.
That we are.
So one is pi's.
And these are the coverage
of a topic in the document.
And the other is word distributions
that characterize all the topics.
So the next line,
then is simply to plug this
in to calculate
the probability of document.
This is, again, of the familiar
form where you have a sum and
you have a count of
a word in the document.
And then log of a probability.
Now it's a little bit more
complicated than the two component.
Because now we have more components,
so the sum involves more terms.
And then this line is just
the likelihood for the whole collection.
And it's very similar, just accounting for
more documents in the collection.
So what are the unknown parameters?
I already said that there are two kinds.
One is coverage,
one is word distributions.
Again, it's a useful exercise for
you to think about.
Exactly how many
parameters there are here.
How many unknown parameters are there?
Now, try and
think out that question will help you
understand the model in more detail.
And will also allow you to understand
what would be the output that we generate
when use PLSA to analyze text data?
And these are precisely
the unknown parameters.
So after we have obtained
the likelihood function shown here,
the next is to worry about
the parameter estimation.
And we can do the usual think,
maximum likelihood estimator.
So again, it's a constrained optimization
problem, like what we have seen before.
Only that we have a collection of text and
we have more parameters to estimate.
And we still have two constraints,
two kinds of constraints.
One is the word distributions.
All the words must have probabilities
that's sum to one for one distribution.
The other is the topic
coverage distribution and
a document will have to cover
precisely these k topics so
the probability of covering each
topic that would have to sum to 1.
So at this point though it's basically
a well defined applied math problem,
you just need to figure out
the solutions to optimization problem.
There's a function with many variables.
and we need to just figure
out the patterns of these
variables to make the function
reach its maximum.
>> [MUSIC]

[SOUND]
We can compute this maximum estimate
by using the EM algorithm.
So in the e step,
we now have to introduce more hidden
variables because we have more topics,
so our hidden variable z now,
which is a topic indicator can
take more than two values.
So specifically will
take a k plus one values,
with b in the noting the background.
And once locate,
to denote other k topics, right.
So, now the e step, as you can
recall is your augmented data, and
by predicting the values
of the hidden variable.
So we're going to predict for
a word, whether the word has come from
one of these k plus one distributions.
This equation allows us to
predict the probability
that the word w in document d is
generated from topic zero sub j.
And the bottom one is
the predicted probability that this
word has been generated
from the background.
Note that we use document
d here to index the word.
Why?
Because whether a word is
from a particular topic
actually depends on the document.
Can you see why?
Well, it's through the pi's.
The pi's are tied to each document.
Each document can have potentially
different pi's, right.
The pi's will then affect our prediction.
So, the pi's are here.
And this depends on the document.
And that might give a different guess for
a word in different documents,
and that's desirable.
In both cases we are using
the Baye's Rule, as I explained, basically
assessing the likelihood of generating
word from each of this division and
there's normalize.
What about the m step?
Well, we may recall the m step is we
take advantage of the inferred z values.
To split the counts.
And then collected the right counts
to re-estimate the parameters.
So in this case, we can re-estimate
our coverage of probability.
And this is re-estimated based on
collecting all the words in the document.
And that's why we have the count
of the word in document.
And sum over all the words.
And then we're going to look at to
what extent this word belongs to
the topic theta sub j.
And this part is our guess from each step.
This tells us how likely this word
is actually from theta sub j.
And when we multiply them together,
we get the discounted count that's
located for topic theta sub j.
And when we normalize
this over all the topics,
we get the distribution of all
the topics to indicate the coverage.
And similarly, the bottom one is the
estimated probability of word for a topic.
And in this case we are using exact
the same count, you can see this is
the same discounted account,
] it tells us to what extend we should
allocate this word [INAUDIBLE] but
then normalization is different.
Because in this case we are interested
in the word distribution, so
we simply normalize this
over all the words.
This is different, in contrast here we
normalize the amount all the topics.
It would be useful to take
a comparison between the two.
This give us different distributions.
And these tells us how to
improve the parameters.
And as I just explained,
in both the formula is we have a maximum
estimate based on allocated
word counts [INAUDIBLE].
Now this phenomena is actually general
phenomena in all the EM algorithms.
In the m-step, you general with
the computer expect an account of
the event based on the e-step result,
and then you just and
then count to four,
particular normalize it, typically.
So, in terms of computation
of this EM algorithm, we can
actually just keep accounting various
events and then normalize them.
And when we thinking this way,
we also have a more concise way
of presenting the EM Algorithm.
It actually helps us better
understand the formulas.
So I'm going to go over
this in some detail.
So as a algorithm we first initialize
all the unknown perimeters randomly,
all right.
So, in our case, we are interested in all
of those coverage perimeters, pi's and
awarded distributions [INAUDIBLE],
and we just randomly normalize them.
This is the initialization step and then
we will repeat until likelihood converges.
Now how do we know whether
likelihood converges?
We can do compute
likelihood at each step and
compare the current likelihood
with the previous likelihood.
If it doesn't change much and
we're going to say it stopped, right.
So, in each step we're
going to do e-step and m-step.
In the e-step we're going to do
augment the data by predicting
the hidden variables.
In this case,
the hidden variable, z sub d, w,
indicates whether the word w in
d is from a topic or background.
And if it's from a topic, which topic.
So if you look at the e-step formulas,
essentially we're actually
normalizing these counts, sorry,
these probabilities of observing
the word from each distribution.
So you can see,
basically the prediction of word
from topic zero sub j is
based on the probability of
selecting that theta sub j as a word
distribution to generate the word.
Multiply by the probability of observing
the word from that distribution.
And I said it's proportional to this
because in the implementation of
EM algorithm you can keep counter for
this quantity, and
in the end it just normalizes it.
So the normalization here
is over all the topics and
then you would get a probability.
Now, in the m-step, we do the same,
and we are going to collect these.
Allocated account for each topic.
And we split words among the topics.
And then we're going to normalize
them in different ways to obtain
the real estimate.
So for example, we can normalize among all
the topics to get the re-estimate of pi,
the coverage.
Or we can re-normalize
based on all the words.
And that would give us
a word distribution.
So it's useful to think algorithm in this
way because when implemented, you can just
use variables, but keep track of
these quantities in each case.
And then you just normalize these
variables to make them distribution.
Now I did not put the constraint for
this one.
And I intentionally leave
this as an exercise for you.
And you can see,
what's the normalizer for this one?
It's of a slightly different form but
it's essentially the same as
the one that you have
seen here in this one.
So in general in the envisioning of EM
algorithms you will see you accumulate
the counts, various counts and
then you normalize them.
So to summarize,
we introduced the PLSA model.
Which is a mixture model with k unigram
language models representing k topics.
And we also added a pre-determined
background language model to
help discover discriminative topics,
because this background language model
can help attract the common terms.
And we select the maximum estimate
that we cant discover topical
knowledge from text data.
In this case PLSA allows us to discover
two things, one is k worded distributions,
each one representing a topic and
the other is the proportion of
each topic in each document.
And such detailed characterization
of coverage of topics in documents
can enable a lot of photo analysis.
For example, we can aggregate
the documents in the particular
pan period to assess the coverage of
a particular topic in a time period.
That would allow us to generate
the temporal chains of topics.
We can also aggregate topics covered in
documents associated with a particular
author and then we can categorize
the topics written by this author, etc.
And in addition to this, we can also
cluster terms and cluster documents.
In fact,
each topic can be regarded as a cluster.
So we already have the term clusters.
In the higher probability,
the words can be regarded as
belonging to one cluster
represented by the topic.
Similarly, documents can be
clustered in the same way.
We can assign a document
to the topic cluster
that's covered most in the document.
So remember, pi's indicate to what extent
each topic is covered in the document,
we can assign the document to the topical
cluster that has the highest pi.
And in general there are many useful
applications of this technique.
[MUSIC]

[MUSIC]
[MUSIC]


[MUSIC]
This lecture is about topic mining and
analysis.
We're going to talk about
using a term as topic.
This is a slide that you have
seen in a earlier lecture
where we define the task of
topic mining and analysis.
We also raised the question, how do
we exactly define the topic of theta?
So in this lecture, we're going to
offer one way to define it, and
that's our initial idea.
Our idea here is defining
a topic simply as a term.
A term can be a word or a phrase.
And in general,
we can use these terms to describe topics.
So our first thought is just
to define a topic as one term.
For example, we might have terms
like sports, travel, or science,
as you see here.
Now if we define a topic in this way,
we can then analyze the coverage
of such topics in each document.
Here for example,
we might want to discover to what
extent document one covers sports.
And we found that 30% of the content
of document one is about sports.
And 12% is about the travel, etc.
We might also discover document
two does not cover sports at all.
So the coverage is zero, etc.
So now, of course,
as we discussed in the task definition for
topic mining and analysis,
we have two tasks.
One is to discover the topics.
And the second is to analyze coverage.
So let's first think
about how we can discover
topics if we represent
each topic by a term.
So that means we need to mine k
topical terms from a collection.
Now there are, of course,
many different ways of doing that.
And we're going to talk about
a natural way of doing that,
which is also likely effective.
So first of all,
we're going to parse the text data in
the collection to obtain candidate terms.
Here candidate terms can be words or
phrases.
Let's say the simplest solution is
to just take each word as a term.
These words then become candidate topics.
Then we're going to design a scoring
function to match how good each term
is as a topic.
So how can we design such a function?
Well there are many things
that we can consider.
For example, we can use pure statistics
to design such a scoring function.
Intuitively, we would like to
favor representative terms,
meaning terms that can represent
a lot of content in the collection.
So that would mean we want
to favor a frequent term.
However, if we simply use the frequency
to design the scoring function,
then the highest scored terms
would be general terms or
functional terms like the, etc.
Those terms occur very frequently English.
So we also want to avoid having
such words on the top so
we want to penalize such words.
But in general, we would like to favor
terms that are fairly frequent but
not so frequent.
So a particular approach could be based
on TF-IDF weighting from retrieval.
And TF stands for term frequency.
IDF stands for inverse document frequency.
We talked about some of these
ideas in the lectures about
the discovery of word associations.
So these are statistical methods,
meaning that the function is
defined mostly based on statistics.
So the scoring function
would be very general.
It can be applied to any language,
any text.
But when we apply such a approach
to a particular problem,
we might also be able to leverage
some domain-specific heuristics.
For example, in news we might favor
title words actually general.
We might want to favor title
words because the authors tend to
use the title to describe
the topic of an article.
If we're dealing with tweets,
we could also favor hashtags,
which are invented to denote topics.
So naturally, hashtags can be good
candidates for representing topics.
Anyway, after we have this design
scoring function, then we can discover
the k topical terms by simply picking
k terms with the highest scores.
Now, of course,
we might encounter situation where the
highest scored terms are all very similar.
They're semantically similar, or
closely related, or even synonyms.
So that's not desirable.
So we also want to have coverage over
all the content in the collection.
So we would like to remove redundancy.
And one way to do that is
to do a greedy algorithm,
which is sometimes called a maximal
marginal relevance ranking.
Basically, the idea is to go down
the list based on our scoring
function and gradually take terms
to collect the k topical terms.
The first term, of course, will be picked.
When we pick the next term, we're
going to look at what terms have already
been picked and try to avoid
picking a term that's too similar.
So while we are considering
the ranking of a term in the list,
we are also considering
the redundancy of the candidate term
with respect to the terms
that we already picked.
And with some thresholding,
then we can get a balance of
the redundancy removal and
also high score of a term.
Okay, so
after this that will get k topical terms.
And those can be regarded as the topics
that we discovered from the connection.
Next, let's think about how we're going
to compute the topic coverage pi sub ij.
So looking at this picture,
we have sports, travel and science and
these topics.
And now suppose you are give a document.
How should we pick out coverage
of each topic in the document?
Well, one approach can be to simply
count occurrences of these terms.
So for example, sports might have occurred
four times in this this document and
travel occurred twice, etc.
And then we can just normalize these
counts as our estimate of the coverage
probability for each topic.
So in general, the formula would
be to collect the counts of
all the terms that represent the topics.
And then simply normalize them so
that the coverage of each
topic in the document would add to one.
This forms a distribution of the topics
for the document to characterize coverage
of different topics in the document.
Now, as always,
when we think about idea for
solving problem, we have to ask
the question, how good is this one?
Or is this the best way
of solving problem?
So now let's examine this approach.
In general,
we have to do some empirical evaluation
by using actual data sets and
to see how well it works.
Well, in this case let's take
a look at a simple example here.
And we have a text document that's
about a NBA basketball game.
So in terms of the content,
it's about sports.
But if we simply count these
words that represent our topics,
we will find that the word sports
actually did not occur in the article,
even though the content
is about the sports.
So the count of sports is zero.
That means the coverage of sports
would be estimated as zero.
Now of course,
the term science also did not occur in
the document and
it's estimate is also zero.
And that's okay.
But sports certainly is not okay because
we know the content is about sports.
So this estimate has problem.
What's worse, the term travel
actually occurred in the document.
So when we estimate the coverage
of the topic travel,
we have got a non-zero count.
So its estimated coverage
will be non-zero.
So this obviously is also not desirable.
So this simple example illustrates
some problems of this approach.
First, when we count what
words belong to to the topic,
we also need to consider related words.
We can't simply just count
the topic word sports.
In this case, it did not occur at all.
But there are many related words
like basketball, game, etc.
So we need to count
the related words also.
The second problem is that a word
like star can be actually ambiguous.
So here it probably means
a basketball star, but
we can imagine it might also
mean a star on the sky.
So in that case, the star might actually
suggest, perhaps, a topic of science.
So we need to deal with that as well.
Finally, a main restriction of this
approach is that we have only one
term to describe the topic, so it cannot
really describe complicated topics.
For example, a very specialized
topic in sports would be harder to
describe by using just a word or
one phrase.
We need to use more words.
So this example illustrates
some general problems with
this approach of treating a term as topic.
First, it lacks expressive power.
Meaning that it can only represent
the simple general topics, but
it cannot represent the complicated topics
that might require more words to describe.
Second, it's incomplete
in vocabulary coverage,
meaning that the topic itself
is only represented as one term.
It does not suggest what other
terms are related to the topic.
Even if we're talking about sports,
there are many terms that are related.
So it does not allow us to easily
count related terms to order,
conversion to coverage of this topic.
Finally, there is this problem
of word sense disintegration.
A topical term or
related term can be ambiguous.
For example,
basketball star versus star in the sky.
So in the next lecture,
we're going to talk
about how to solve
the problem with of a topic.
[MUSIC]

This lecture is about Probabilistic Topic
Models for topic mining and analysis.
In this lecture,
we're going to continue talking
about the topic mining and analysis.
We're going to introduce
probabilistic topic models.
So this is a slide that
you have seen earlier,
where we discussed the problems
with using a term as a topic.
So, to solve these problems
intuitively we need to use
more words to describe the topic.
And this will address the problem
of lack of expressive power.
When we have more words that we
can use to describe the topic,
that we can describe complicated topics.
To address the second problem we
need to introduce weights on words.
This is what allows you to distinguish
subtle differences in topics, and
to introduce semantically
related words in a fuzzy manner.
Finally, to solve the problem of
word ambiguity, we need to split
ambiguous word, so
that we can disambiguate its topic.
It turns out that all these can be done
by using a probabilistic topic model.
And that's why we're going to spend a lot
of lectures to talk about this topic.
So the basic idea here is that,
improve the replantation of
topic as one distribution.
So what you see now is
the older replantation.
Where we replanted each topic, it was just
one word, or one term, or one phrase.
But now we're going to use a word
distribution to describe the topic.
So here you see that for sports.
We're going to use
the word distribution over
theoretical speaking all
the words in our vocabulary.
So for example, the high
probability words here are sports,
game, basketball,
football, play, star, etc.
These are sports related terms.
And of course it would also give
a non-zero probability to some other word
like Trouble which might be
related to sports in general,
not so much related to topic.
In general we can imagine a non
zero probability for all the words.
And some words that are not read and
would have very, very small probabilities.
And these probabilities will sum to one.
So that it forms a distribution
of all the words.
Now intuitively, this distribution
represents a topic in that if we assemble
words from the distribution, we tended
to see words that are ready to dispose.
You can also see, as a very special case,
if the probability of the mass
is concentrated in entirely on
just one word, it's sports.
And this basically degenerates
to the symbol foundation
of a topic was just one word.
But as a distribution,
this topic of representation can,
in general,
involve many words to describe a topic and
can model several differences
in semantics of a topic.
Similarly we can model Travel and Science
with their respective distributions.
In the distribution for Travel we see top
words like attraction, trip, flight etc.
Whereas in Science we see scientist,
spaceship, telescope, or
genomics, and, you know,
science related terms.
Now that doesn't mean sports related terms
will necessarily have zero
probabilities for science.
In general we can imagine all of these
words we have now zero probabilities.
It's just that for a particular
topic in some words we have very,
very small probabilities.
Now you can also see there are some
words that are shared by these topics.
When I say shared it just means even
with some probability threshold,
you can still see one word
occurring much more topics.
In this case I mark them in black.
So you can see travel, for example,
occurred in all the three topics here, but
with different probabilities.
It has the highest probability for
the Travel topic, 0.05.
But with much smaller probabilities for
Sports and Science, which makes sense.
And similarly, you can see a Star
also occurred in Sports and
Science with reasonably
high probabilities.
Because they might be actually
related to the two topics.
So with this replantation it addresses the
three problems that I mentioned earlier.
First, it now uses multiple
words to describe a topic.
So it allows us to describe
a fairly complicated topics.
Second, it assigns weights to terms.
So now we can model several
differences of semantics.
And you can bring in related
words together to model a topic.
Third, because we have probabilities for
the same word in different topics,
we can disintegrate the sense of word.
In the text to decode
it's underlying topic,
to address all these three problems with
this new way of representing a topic.
So now of course our problem definition
has been refined just slightly.
The slight is very similar to what
you've seen before except we have
added refinement for what our topic is.
Now each topic is word distribution,
and for each word distribution we know
that all the probabilities should sum to
one with all the words in the vocabulary.
So you see a constraint here.
And we still have another constraint
on the topic coverage, namely pis.
So all the Pi sub ij's must sum to one for
the same document.
So how do we solve this problem?
Well, let's look at this problem
as a computation problem.
So we clearly specify it's input and
output and
illustrate it here on this side.
Input of course is our text data.
C is our collection but we also generally
assume we know the number of topics, k.
Or we hypothesize a number and
then try to bind k topics,
even though we don't know the exact
topics that exist in the collection.
And V is the vocabulary that has
a set of words that determines what
units would be treated as
the basic units for analysis.
In most cases we'll use words
as the basis for analysis.
And that means each word is a unique.
Now the output would consist of as first
a set of topics represented by theta I's.
Each theta I is a word distribution.
And we also want to know the coverage
of topics in each document.
So that's.
That the same pi ijs
that we have seen before.
So given a set of text data we would
like compute all these distributions and
all these coverages as you
have seen on this slide.
Now of course there may be many
different ways of solving this problem.
In theory, you can write the [INAUDIBLE]
program to solve this problem,
but here we're going to introduce
a general way of solving this
problem called a generative model.
And this is, in fact,
a very general idea and
it's a principle way of using statistical
modeling to solve text mining problems.
And here I dimmed the picture
that you have seen before
in order to show the generation process.
So the idea of this approach is actually
to first design a model for our data.
So we design a probabilistic model
to model how the data are generated.
Of course,
this is based on our assumption.
The actual data aren't
necessarily generating this way.
So that gave us a probability
distribution of the data
that you are seeing on this slide.
Given a particular model and
parameters that are denoted by lambda.
So this template of actually consists of
all the parameters that
we're interested in.
And these parameters in general
will control the behavior of
the probability risk model.
Meaning that if you set these
parameters with different values and
it will give some data points
higher probabilities than others.
Now in this case of course,
for our text mining problem or
more precisely topic mining problem
we have the following plans.
First of all we have theta i's which
is a word distribution snd then we have
a set of pis for each document.
And since we have n documents, so we have
n sets of pis, and each set the pi up.
The pi values will sum to one.
So this is to say that we
first would pretend we already
have these word distributions and
the coverage numbers.
And then we can see how we can generate
data by using such distributions.
So how do we model the data in this way?
And we assume that the data
are actual symbols
drawn from such a model that
depends on these parameters.
Now one interesting question here is to
think about how many
parameters are there in total?
Now obviously we can already see
n multiplied by K parameters.
For pi's.
We also see k theta i's.
But each theta i is actually a set
of probability values, right?
It's a distribution of words.
So I leave this as an exercise for
you to figure out exactly how
many parameters there are here.
Now once we set up the model then
we can fit the model to our data.
Meaning that we can
estimate the parameters or
infer the parameters based on the data.
In other words we would like to
adjust these parameter values.
Until we give our data set
the maximum probability.
I just said,
depending on the parameter values,
some data points will have higher
probabilities than others.
What we're interested in, here,
is what parameter values will give
our data set the highest probability?
So I also illustrate the problem
with a picture that you see here.
On the X axis I just illustrate lambda,
the parameters,
as a one dimensional variable.
It's oversimplification, obviously,
but it suffices to show the idea.
And the Y axis shows the probability
of the data, observe.
This probability obviously depends
on this setting of lambda.
So that's why it varies as you
change the value of lambda.
What we're interested here
is to find the lambda star.
That would maximize the probability
of the observed data.
So this would be, then,
our estimate of the parameters.
And these parameters,
note that are precisely what we
hoped to discover from text data.
So we'd treat these parameters
as actually the outcome or
the output of the data mining algorithm.
So this is the general idea of using
a generative model for text mining.
First, we design a model with
some parameter values to fit
the data as well as we can.
After we have fit the data,
we will recover some parameter value.
We will use the specific
parameter value And
those would be the output
of the algorithm.
And we'll treat those as actually
the discovered knowledge from text data.
By varying the model of course we
can discover different knowledge.
So to summarize, we introduced
a new way of representing topic,
namely representing as word distribution
and this has the advantage of using
multiple words to describe a complicated
topic.It also allow us to assign
weights on words so we have more than
several variations of semantics.
We talked about the task of topic mining,
and answers.
When we define a topic as distribution.
So the importer is a clashing of text
articles and a number of topics and
a vocabulary set and
the output is a set of topics.
Each is a word distribution and
also the coverage of all
the topics in each document.
And these are formally represented
by theta i's and pi i's.
And we have two constraints here for
these parameters.
The first is the constraints
on the worded distributions.
In each worded distribution
the probability of all the words
must sum to 1,
all the words in the vocabulary.
The second constraint is on
the topic coverage in each document.
A document is not allowed to recover
a topic outside of the set of topics that
we are discovering.
So, the coverage of each of these k
topics would sum to one for a document.
We also introduce a general idea of using
a generative model for text mining.
And the idea here is, first we're design
a model to model the generation of data.
We simply assume that they
are generative in this way.
And inside the model we embed some
parameters that we're interested in
denoted by lambda.
And then we can infer the most
likely parameter values lambda star,
given a particular data set.
And we can then take the lambda star as
knowledge discovered from the text for
our problem.
And we can adjust
the design of the model and
the parameters to discover various
kinds of knowledge from text.
As you will see later
in the other lectures.
[MUSIC]

[SOUND]
>> This
lecture is about the Overview
of Statistical Language Models,
which cover proper
models as special cases.
In this lecture we're going to give
a overview of Statical Language Models.
These models are general models that cover
probabilistic topic models
as a special cases.
So first off,
what is a Statistical Language Model?
A Statistical Language Model is
basically a probability distribution
over word sequences.
So, for example,
we might have a distribution that gives,
today is Wednesday a probability of .001.
It might give today Wednesday is, which
is a non-grammatical sentence, a very,
very small probability as shown here.
And similarly another sentence,
the eigenvalue is positive might
get the probability of .00001.
So as you can see such a distribution
clearly is Context Dependent.
It depends on the Context of Discussion.
Some Word Sequences might have higher
probabilities than others but the same
Sequence of Words might have different
probability in different context.
And so this suggests that such a
distribution can actually categorize topic
such a model can also be regarded
as Probabilistic Mechanism for
generating text.
And that just means we can view text
data as data observed from such a model.
For this reason,
we call such a model as Generating Model.
So, now given a model we can then
assemble sequences of words.
So, for example, based on the distribution
that I have shown here on this slide,
when matter it say assemble
a sequence like today is Wednesday
because it has a relative
high probability.
We might often get such a sequence.
We might also get the item
value as positive sometimes
with a smaller probability and
very, very occasionally we might
get today is Wednesday because
it's probability is so small.
So in general, in order to categorize such
a distribution we must specify probability
values for
all these different sequences of words.
Obviously, it's impossible
to specify that because it's
impossible to enumerate all of
the possible sequences of words.
So in practice, we will have to
simplify the model in some way.
So, the simplest language model is
called the Unigram Language Model.
In such a case, it was simply a the text
is generated by generating
each word independently.
But in general, the words may
not be generated independently.
But after we make this assumption, we can
significantly simplify the language more.
Basically, now the probability of
a sequence of words, w1 through wn,
will be just the product of
the probability of each word.
So for such a model,
we have as many parameters as
the number of words in our vocabulary.
So here we assume we have n words,
so we have n probabilities.
One for each word.
And then some to 1.
So, now we assume that
our text is a sample
drawn according to this word distribution.
That just means,
we're going to draw a word each time and
then eventually we'll get a text.
So for example, now again,
we can try to assemble words
according to a distribution.
We might get Wednesday often or
today often.
And some other words like eigenvalue
might have a small probability, etcetera.
But with this, we actually can
also compute the probability of
every sequence, even though our model
only specify the probabilities of words.
And this is because of the independence.
So specifically, we can compute
the probability of today is Wednesday.
Because it's just a product
of the probability of today,
the probability of is, and
probability of Wednesday.
For example,
I show some fake numbers here and when you
multiply these numbers together you get
the probability that today's Wednesday.
So as you can see, with N probabilities,
one for each word, we actually
can characterize the probability situation
over all kinds of sequences of words.
And so, this is a very simple model.
Ignore the word order.
So it may not be, in fact, in some
problems, such as for speech recognition,
where you may care about
the order of words.
But it turns out to be
quite sufficient for
many tasks that involve topic analysis.
And that's also what
we're interested in here.
So when we have a model, we generally have
two problems that we can think about.
One is, given a model, how likely are we
to observe a certain kind of data points?
That is,
we are interested in the Sampling Process.
The other is the Estimation Process.
And that, is to think of
the parameters of a model given,
some observe the data and we're
going to talk about that in a moment.
Let's first talk about the sampling.
So, here I show two examples of Water
Distributions or Unigram Language Models.
The first one has higher probabilities for
words like a text mining association,
it's separate.
Now this signals a topic about text mining
because when we assemble words from
such a distribution, we tend to see words
that often occur in text mining contest.
So in this case,
if we ask the question about
what is the probability of
generating a particular document.
Then, we likely will see text that
looks like a text mining paper.
Of course, the text that we
generate by drawing words.
This distribution is unlikely coherent.
Although, the probability
of generating attacks mine
[INAUDIBLE] publishing
in the top conference is
non-zero assuming that no word has
a zero probability in the distribution.
And that just means,
we can essentially generate all kinds of
text documents including very
meaningful text documents.
Now, the second distribution show,
on the bottom, has different than
what was high probabilities.
So food [INAUDIBLE] healthy [INAUDIBLE],
etcetera.
So this clearly indicates
a different topic.
In this case it's probably about health.
So if we sample a word
from such a distribution,
then the probability of observing a text
mining paper would be very, very small.
On the other hand, the probability of
observing a text that looks like a food
nutrition paper would be high,
relatively higher.
So that just means, given a particular
distribution, different than the text.
Now let's look at
the estimation problem now.
In this case, we're going to assume
that we have observed the data.
I will know exactly what
the text data looks like.
In this case,
let's assume we have a text mining paper.
In fact, it's abstract of the paper,
so the total number of words is 100.
And I've shown some counts
of individual words here.
Now, if we ask the question,
what is the most likely
Language Model that has been
used to generate this text data?
Assuming that the text is observed
from some Language Model,
what's our best guess
of this Language Model?
Okay, so the problem now is just to
estimate the probabilities of these words.
As I've shown here.
So what do you think?
What would be your guess?
Would you guess text has
a very small probability, or
a relatively large probability?
What about query?
Well, your guess probably
would be dependent on
how many times we have observed
this word in the text data, right?
And if you think about it for a moment.
And if you are like many others,
you would have guessed that,
well, text has a probability of 10
out of 100 because I've observed
the text 10 times in the text
that has a total of 100 words.
And similarly, mining has 5 out of 100.
And query has a relatively small
probability, just observed for once.
So it's 1 out of 100.
Right, so that, intuitively,
is a reasonable guess.
But the question is, is this our best
guess or best estimate of the parameters?
Of course,
in order to answer this question,
we have to define what do we mean by best,
in this case,
it turns out that our
guesses are indeed the best.
In some sense and this is called
Maximum Likelihood Estimate.
And it's the best thing that, it will give
the observer data our maximum probability.
Meaning that, if you change
the estimate somehow, even slightly,
then the probability of the observed
text data will be somewhat smaller.
And this is called
a Maximum Likelihood Estimate.
[MUSIC]

[MUSIC]
So now let's talk about the problem
a little bit more, and specifically let's
talk about the two different ways
of estimating the parameters.
One is called the Maximum Likelihood
estimate that I already just mentioned.
The other is Bayesian estimation.
So in maximum likelihood estimation,
we define best as
meaning the data likelihood
has reached the maximum.
So formally it's given
by this expression here,
where we define the estimate as a arg
max of the probability of x given theta.
So, arg max here just means its
actually a function that will turn.
The argument that gives the function
maximum value, adds the value.
So the value of arg max is not
the value of this function.
But rather, the argument that has
made it the function reaches maximum.
So in this case the value
of arg max is theta.
It's the theta that makes the probability
of X, given theta, reach it's maximum.
So this estimate that in due it also
makes sense and it's often very useful,
and it seeks the premise
that best explains the data.
But it has a problem, when the data
is too small because when the data
points are too small,
there are very few data points.
The sample is small,
then if we trust data in entirely and
try to fit the data and
then we'll be biased.
So in the case of text data,
let's say, all observed 100
words did not contain another
word related to text mining.
Now, our maximum likelihood estimator
will give that word a zero probability.
Because giving the non-zero probability
would take away probability
mass from some observer word.
Which obviously is not optimal in
terms of maximizing the likelihood of
the observer data.
But this zero probability for
all the unseen words may not
be reasonable sometimes.
Especially, if we want the distribution
to characterize the topic of text mining.
So one way to address this problem is
actually to use Bayesian estimation,
where we actually would look
at the both the data, and
our prior knowledge about the parameters.
We assume that we have some prior
belief about the parameters.
Now in this case of course, so we are not
going to look at just the data,
but also look at the prior.
So the prior here is
defined by P of theta, and
this means, we will impose some
preference on certain theta's of others.
And by using Bayes Rule,
that I have shown here,
we can then combine
the likelihood function.
With the prior to give us this
posterior probability of the parameter.
Now, a full explanation of Bayes rule,
and some of these things
related to Bayesian reasoning,
would be outside the scope of this course.
But I just gave a brief
introduction because this is
general knowledge that
might be useful to you.
The Bayes Rule is basically defined here,
and
allows us to write down one
conditional probability of X
given Y in terms of the conditional
probability of Y given X.
And you can see the two probabilities
are different in the order
of the two variables.
But often the rule is used for
making inferences
of the variable, so
let's take a look at it again.
We can assume that p(X) Encodes
our prior belief about X.
That means before we observe any other
data, that's our belief about X,
what we believe some X values have
higher probability than others.
And this probability of X given Y
is a conditional probability, and
this is our posterior belief about X.
Because this is our belief about X
values after we have observed the Y.
Given that we have observed the Y,
now what do we believe about X?
Now, do we believe some values have
higher probabilities than others?
Now the two probabilities
are related through this one,
this can be regarded as the probability of
the observed evidence Y,
given a particular X.
So you can think about X
as our hypothesis, and
we have some prior belief about
which hypothesis to choose.
And after we have observed Y,
we will update our belief and
this updating formula is based
on the combination of our prior.
And the likelihood of observing
this Y if X is indeed true,
so much for detour about Bayes Rule.
In our case, what we are interested
in is inferring the theta values.
So, we have a prior here that includes
our prior knowledge about the parameters.
And then we have the data likelihood here,
that would tell us which parameter
value can explain the data well.
The posterior probability
combines both of them,
so it represents a compromise
of the the two preferences.
And in such a case, we can maximize
this posterior probability.
To find this theta that would
maximize this posterior probability,
and this estimator is called a Maximum
a Posteriori, or MAP estimate.
And this estimator is
a more general estimator than
the maximum likelihood estimator.
Because if we define our prior
as a noninformative prior,
meaning that it's uniform
over all the theta values.
No preference, then we basically would go
back to the maximum likelihood estimated.
Because in such a case,
it's mainly going to be determined by
this likelihood value, the same as here.
But if we have some not informative prior,
some bias towards
the different values then map estimator
can allow us to incorporate that.
But the problem here of course,
is how to define the prior.
There is no free lunch and if you want to
solve the problem with more knowledge,
we have to have that knowledge.
And that knowledge,
ideally, should be reliable.
Otherwise, your estimate may not
necessarily be more accurate than that
maximum likelihood estimate.
So, now let's look at the Bayesian
estimation in more detail.
So, I show the theta values as just a one
dimension value and
that's a simplification of course.
And so, we're interested in which
variable of theta is optimal.
So now, first we have the Prior.
The Prior tells us that
some of the variables
are more likely the others would believe.
For example, these values are more
likely than the values over here,
or here, or other places.
So this is our Prior, and
then we have our theta likelihood.
And in this case, the theta also tells us
which values of theta are more likely.
And that just means loose syllables
can best expand our theta.
And then when we combine the two
we get the posterior distribution,
and that's just a compromise of the two.
It would say that it's
somewhere in-between.
So, we can now look at some
interesting point that is made of.
This point represents the mode of prior,
that means the most likely parameter
value according to our prior,
before we observe any data.
This point is the maximum
likelihood estimator,
it represents the theta that gives
the theta of maximum probability.
Now this point is interesting,
it's the posterior mode.
It's the most likely value of the theta
given by the posterior of this.
And it represents a good
compromise of the prior mode and
the maximum likelihood estimate.
Now in general in Bayesian inference,
we are interested in
the distribution of all these
parameter additives as you see here.
If there's a distribution over
see how values that you can see.
Here, P of theta given X.
So the problem of Bayesian inference is
to infer this posterior, this regime, and
also to infer other interesting
quantities that might depend on theta.
So, I show f of theta here
as an interesting variable
that we want to compute.
But in order to compute this value,
we need to know the value of theta.
In Bayesian inference,
we treat theta as an uncertain variable.
So we think about all
the possible variables of theta.
Therefore, we can estimate the value of
this function f as extracted value of f,
according to the posterior distribution
of theta, given the observed evidence X.
As a special case, we can assume f
of theta is just equal to theta.
In this case,
we get the expected value of the theta,
that's basically the posterior mean.
That gives us also one point of theta, and
it's sometimes the same as posterior mode,
but it's not always the same.
So, it gives us another way
to estimate the parameter.
So, this is a general illustration of
Bayesian estimation and its an influence.
And later,
you will see this can be useful for
topic mining where we want to inject
the sum prior knowledge about the topics.
So to summarize,
we've used the language model
which is basically probability
distribution over text.
It's also called a generative model for
text data.
The simplest language model
is Unigram Language Model,
it's basically a word distribution.
We introduced the concept
of likelihood function,
which is the probability of
the a data given some model.
And this function is very important,
given a particular set of parameter
values this function can tell us which X,
which data point has a higher likelihood,
higher probability.
Given a data sample X,
we can use this function to determine
which parameter values would maximize
the probability of the observed data,
and this is the maximum
livelihood estimate.
We also talk about the Bayesian
estimation or inference.
In this case we, must define a prior
on the parameters p of theta.
And then we're interested in computing the
posterior distribution of the parameters,
which is proportional to the prior and
the likelihood.
And this distribution would allow us then
to infer any derive that is from theta.
[MUSIC]

[SOUND] This lecture is a continued
discussion of probabilistic topic models.
In this lecture, we're going to continue
discussing probabilistic models.
We're going to talk about
a very simple case where we
are interested in just mining
one topic from one document.
So in this simple setup,
we are interested in analyzing
one document and
trying to discover just one topic.
So this is the simplest
case of topic model.
The input now no longer has k,
which is the number of topics because we
know there is only one topic and the
collection has only one document, also.
In the output,
we also no longer have coverage because
we assumed that the document
covers this topic 100%.
So the main goal is just to discover
the world of probabilities for
this single topic, as shown here.
As always, when we think about using a
generating model to solve such a problem,
we start with thinking about what
kind of data we are going to model or
from what perspective we're going to
model the data or data representation.
And then we're going to
design a specific model for
the generating of the data,
from our perspective.
Where our perspective just means we want
to take a particular angle of looking at
the data, so that the model will
have the right parameters for
discovering the knowledge that we want.
And then we'll be thinking
about the microfunction or
write down the microfunction to
capture more formally how likely
a data point will be
obtained from this model.
And the likelihood function will have
some parameters in the function.
And then we argue our interest in
estimating those parameters for example,
by maximizing the likelihood which will
lead to maximum likelihood estimated.
These estimator parameters
will then become the output
of the mining hours,
which means we'll take the estimating
parameters as the knowledge
that we discover from the text.
So let's look at these steps for
this very simple case.
Later we'll look at this procedure for
some more complicated cases.
So our data, in this case is, just
a document which is a sequence of words.
Each word here is denoted by x sub i.
Our model is a Unigram language model.
A word distribution that we hope to
denote a topic and that's our goal.
So we will have as many parameters as many
words in our vocabulary, in this case M.
And for convenience we're
going to use theta sub i to
denote the probability of word w sub i.
And obviously these theta
sub i's will sum to 1.
Now what does a likelihood
function look like?
Well, this is just the probability
of generating this whole document,
that given such a model.
Because we assume the independence in
generating each word so the probability of
the document will be just a product
of the probability of each word.
And since some word might
have repeated occurrences.
So we can also rewrite this
product in a different form.
So in this line, we have rewritten
the formula into a product
over all the unique words in
the vocabulary, w sub 1 through w sub M.
Now this is different
from the previous line.
Well, the product is over different
positions of words in the document.
Now when we do this transformation,
we then would need to
introduce a counter function here.
This denotes the count of
word one in document and
similarly this is the count
of words of n in the document
because these words might
have repeated occurrences.
You can also see if a word did
not occur in the document.
It will have a zero count, therefore
that corresponding term will disappear.
So this is a very useful form of
writing down the likelihood function
that we will often use later.
So I want you to pay attention to this,
just get familiar with this notation.
It's just to change the product over all
the different words in the vocabulary.
So in the end, of course, we'll use
theta sub i to express this likelihood
function and it would look like this.
Next, we're going to find
the theta values or probabilities
of these words that would maximize
this likelihood function.
So now lets take a look at the maximum
likelihood estimate problem more closely.
This line is copied from
the previous slide.
It's just our likelihood function.
So our goal is to maximize
this likelihood function.
We will find it often easy to
maximize the local likelihood
instead of the original likelihood.
And this is purely for
mathematical convenience because after
the logarithm transformation our function
will becomes a sum instead of product.
And we also have constraints
over these these probabilities.
The sum makes it easier to take
derivative, which is often needed for
finding the optimal
solution of this function.
So please take a look at this sum again,
here.
And this is a form of
a function that you will often
see later also,
the more general topic models.
So it's a sum over all
the words in the vocabulary.
And inside the sum there is
a count of a word in the document.
And this is macroed by
the logarithm of a probability.
So let's see how we can
solve this problem.
Now at this point the problem is purely a
mathematical problem because we are going
to just the find the optimal solution
of a constrained maximization problem.
The objective function is
the likelihood function and
the constraint is that all these
probabilities must sum to one.
So, one way to solve the problem is
to use Lagrange multiplier approace.
Now this command is beyond
the scope of this course but
since Lagrange multiplier is a very
useful approach, I also would like
to just give a brief introduction to this,
for those of you who are interested.
So in this approach we will
construct a Lagrange function, here.
And this function will combine
our objective function
with another term that
encodes our constraint and
we introduce Lagrange multiplier here,
lambda, so it's an additional parameter.
Now, the idea of this approach is just to
turn the constraint optimization into,
in some sense,
an unconstrained optimizing problem.
Now we are just interested in
optimizing this Lagrange function.
As you may recall from calculus,
an optimal point
would be achieved when
the derivative is set to zero.
This is a necessary condition.
It's not sufficient, though.
So if we do that you will
see the partial derivative,
with respect to theta i
here ,is equal to this.
And this part comes from the derivative
of the logarithm function and
this lambda is simply taken from here.
And when we set it to zero we can
easily see theta sub i is
related to lambda in this way.
Since we know all the theta
i's must a sum to one
we can plug this into this constraint,
here.
And this will allow us to solve for
lambda.
And this is just a net
sum of all the counts.
And this further allows us to then
solve the optimization problem,
eventually, to find the optimal
setting for theta sub i.
And if you look at this formula it turns
out that it's actually very intuitive
because this is just the normalized
count of these words by the document ns,
which is also a sum of all
the counts of words in the document.
So, after all this mess, after all,
we have just obtained something
that's very intuitive and
this will be just our
intuition where we want to
maximize the data by
assigning as much probability
mass as possible to all
the observed the words here.
And you might also notice that this is
the general result of maximum likelihood
raised estimator.
In general, the estimator would be to
normalize counts and it's just sometimes
the counts have to be done in a particular
way, as you will also see later.
So this is basically an analytical
solution to our optimization problem.
In general though, when the likelihood
function is very complicated, we're not
going to be able to solve the optimization
problem by having a closed form formula.
Instead we have to use some
numerical algorithms and
we're going to see such cases later, also.
So if you imagine what would we
get if we use such a maximum
likelihood estimator to estimate one
topic for a single document d here?
Let's imagine this document
is a text mining paper.
Now, what you might see is
something that looks like this.
On the top, you will see the high
probability words tend to be those very
common words,
often functional words in English.
And this will be followed by
some content words that really
characterize the topic well like text,
mining, etc.
And then in the end,
you also see there is more probability of
words that are not really
related to the topic but
they might be extraneously
mentioned in the document.
As a topic representation,
you will see this is not ideal, right?
That because the high probability
words are functional words,
they are not really
characterizing the topic.
So my question is how can we
get rid of such common words?
Now this is the topic of the next module.
We're going to talk about how to use
probabilistic models to somehow get rid of
these common words.
[MUSIC]

[MUSIC]
This lecture is about the mixture
of unigram language models.
In this lecture we will continue
discussing probabilistic topic models.
In particular, what we introduce
a mixture of unigram language models.
This is a slide that
you have seen earlier.
Where we talked about how to
get rid of the background
words that we have on top of for
one document.
So if you want to solve the problem,
it would be useful to think about
why we end up having this problem.
Well, this obviously because these
words are very frequent in our data and
we are using a maximum
likelihood to estimate.
Then the estimate obviously would
have to assign high probability for
these words in order to
maximize the likelihood.
So, in order to get rid of them that
would mean we'd have to do something
differently here.
In particular we'll have
to say this distribution
doesn't have to explain all
the words in the tax data.
What were going to say is that,
these common words should not be
explained by this distribution.
So one natural way to solve the problem is
to think about using another distribution
to account for just these common words.
This way, the two distributions can be
mixed together to generate the text data.
And we'll let the other model which
we'll call background topic model
to generate the common words.
This way our target topic theta
here will be only generating
the common handle words that are
characterised the content of the document.
So, how does this work?
Well, it is just a small
modification of the previous setup
where we have just one distribution.
Since we now have two distributions,
we have to decide which distribution
to use when we generate the word.
Each word will still be a sample
from one of the two distributions.
Text data is still
generating the same way.
Namely, look at the generating
of the one word at each time and
eventually we generate a lot of words.
When we generate the word,
however, we're going to first decide
which of the two distributions to use.
And this is controlled by another
probability, the probability of
theta sub d and
the probability of theta sub B here.
So this is a probability of enacting
the topic word of distribution.
This is the probability of
enacting the background word
of distribution denoted by theta sub B.
On this case I just give example
where we can set both to 0.5.
So you're going to basically flip a coin,
a fair coin,
to decide what you want to use.
But in general these probabilities
don't have to be equal.
So you might bias toward using
one topic more than the other.
So now the process of generating a word
would be to first we flip a coin.
Based on these probabilities choosing
each model and if let's say the coin
shows up as head, which means we're going
to use the topic two word distribution.
Then we're going to use this word
distribution to generate a word.
Otherwise we might be
going slow this path.
And we're going to use the background
word distribution to generate a word.
So in such a case,
we have a model that has some uncertainty
associated with the use
of a word distribution.
But we can still think of this as
a model for generating text data.
And such a model is
called a mixture model.
So now let's see.
In this case, what's the probability
of observing a word w?
Now here I showed some words.
like "the" and "text".
So as in all cases,
once we setup a model we are interested
in computing the likelihood function.
The basic question is, so
what's the probability of
observing a specific word here?
Now we know that the word can be observed
from each of the two distributions, so
we have to consider two cases.
Therefore it's a sum over these two cases.
The first case is to use the topic for
the distribution to generate the word.
And in such a case then
the probably would be theta sub d,
which is the probability
of choosing the model
multiplied by the probability of actually
observing the word from that model.
Both events must happen
in order to observe.
We first must have choosing
the topic theta sub d and then,
we also have to actually have sampled
the word the from the distribution.
And similarly,
the second part accounts for
a different way of generally
the word from the background.
Now obviously the probability of
text the same is all similar, right?
So we also can see the two
ways of generating the text.
And in each case, it's a product of the
probability of choosing a particular word
is multiplied by the probability of
observing the word from that distribution.
Now whether you will see,
this is actually a general form.
So might want to make sure that you have
really understood this expression here.
And you should convince yourself that
this is indeed the probability of
obsolete text.
So to summarize what we observed here.
The probability of a word from
a mixture model is a general sum
of different ways of generating the word.
In each case,
it's a product of the probability
of selecting that component model.
Multiplied by the probability of
actually observing the data point
from that component of the model.
And this is something quite general and
you will see this occurring often later.
So the basic idea of a mixture
model is just to retrieve
thesetwo distributions
together as one model.
So I used a box to bring all
these components together.
So if you view this
whole box as one model,
it's just like any other generative model.
It would just give us
the probability of a word.
But the way that determines this
probability is quite the different from
when we have just one distribution.
And this is basically a more
complicated mixture model.
So the more complicated is more
than just one distribution.
And it's called a mixture model.
So as I just said we can treat
this as a generative model.
And it's often useful to think of
just as a likelihood function.
The illustration that
you have seen before,
which is dimmer now, is just
the illustration of this generated model.
So mathematically,
this model is nothing but
to just define the following
generative model.
Where the probability of a word is
assumed to be a sum over two cases
of generating the word.
And the form you are seeing now
is a more general form that
what you have seen in
the calculation earlier.
Well I just use the symbol
w to denote any water but
you can still see this is
basically first a sum.
Right?
And this sum is due to the fact that the
water can be generated in much more ways,
two ways in this case.
And inside a sum,
each term is a product of two terms.
And the two terms are first
the probability of selecting a component
like of D Second,
the probability of actually observing
the word from this component of the model.
So this is a very general description
of all the mixture models.
I just want to make sure
that you understand
this because this is really the basis for
understanding all kinds of on top models.
So now once we setup model.
We can write down that like
functioning as we see here.
The next question is,
how can we estimate the parameter,
or what to do with the parameters.
Given the data.
Well, in general,
we can use some of the text data
to estimate the model parameters.
And this estimation would allow us to
discover the interesting
knowledge about the text.
So you, in this case, what do we discover?
Well, these are presented
by our parameters and
we will have two kinds of parameters.
One is the two worded distributions,
that result in topics, and
the other is the coverage
of each topic in each.
The coverage of each topic.
And this is determined by
probability of C less of D and
probability of theta, so this is to one.
Now, what's interesting is
also to think about special
cases like when we send one of
them to want what would happen?
Well with the other, with the zero right?
And if you look at
the likelihood function,
it will then degenerate to the special
case of just one distribution.
Okay so you can easily verify that by
assuming one of these two is 1.0 and
the other is Zero.
So in this sense,
the mixture model is more general than
the previous model where we
have just one distribution.
It can cover that as a special case.
So to summarize, we talked about the
mixture of two Unigram Language Models and
the data we're considering
here is just One document.
And the model is a mixture
model with two components,
two unigram LM models,
specifically theta sub d,
which is intended to denote the topic of
document d, and theta sub B, which is
representing a background topic that
we can set to attract the common
words because common words would be
assigned a high probability in this model.
So the parameters can
be collectively called
Lambda which I show here you can again
think about the question about how many
parameters are we talking about exactly.
This is usually a good exercise to do
because it allows you to see the model in
depth and to have a complete understanding
of what's going on this model.
And we have mixing weights,
of course, also.
So what does a likelihood
function look like?
Well, it looks very similar
to what we had before.
So for the document,
first it's a product over all the words in
the document exactly the same as before.
The only difference is that inside here
now it's a sum instead of just one.
So you might have recalled before
we just had this one there.
But now we have this sum
because of the mixture model.
And because of the mixture model we
also have to introduce a probability of
choosing that particular
component of distribution.
And so
this is just another way of writing, and
by using a product over all the unique
words in our vocabulary instead of
having that product over all
the positions in the document.
And this form where we look at
the different and unique words is
a commutative that formed for computing
the maximum likelihood estimate later.
And the maximum likelihood estimator is,
as usual,
just to find the parameters that would
maximize the likelihood function.
And the constraints here
are of course two kinds.
One is what are probabilities in each
[INAUDIBLE] must sum to 1 the other is
the choice of each
[INAUDIBLE] must sum to 1.
[MUSIC]

[SOUND]
This
lecture is about mixture model estimation.
In this lecture we're going to continue
discussing probabilistic topic models.
In particular,
we're going to talk about how to estimate
the parameters of a mixture model.
So let's first look at our motivation for
using a mixture model.
And we hope to factor out
the background words.
From the top-words equation.
The idea is to assume that the text data
actually contained two kinds of words.
One kind is from the background here.
So, the is, we, etc.
And the other kind is from our pop board
distribution that we are interested in.
So in order to solve this problem
of factoring out background words,
we can set up our mixture model as false.
We're going to assume that we
already know the parameters of
all the values for
all the parameters in the mixture model,
except for the water distribution
of which is our target.
So this is a case of customizing
a probabilist model so
that we embedded a known variable
that we are interested in.
But we're going to simplify other things.
We're going to assume we
have knowledge above others.
And this is a powerful way
of customizing a model.
For a particular need.
Now you can imagine,
we could have assumed that we also
don't know the background words.
But in this case,
our goal is to factor out precisely
those high probability background words.
So we assume the background
model is already fixed.
And one problem here is how
can we adjust theta sub d
in order to maximize the probability
of the observed document here and
we assume all the other
perimeters are now.
Now although we designed
the model holistically.
To try to factor out
these background words.
It's unclear whether,
if we use maximum write or estimator.
We will actually end up having
a whole distribution where the Common
words like the would indeed have
smaller probabilities than before.
Now in this case it turns
out the answer is yes.
And when we set up
the probability in this way,
when we use maximum likelihood or
we will end up having a word distribution
where the use common words
would be factored out.
By the use of the background
rule of distribution.
So to understand why this is so,
it's useful to examine
the behavior of a mixture model.
So we're going to look at
a very very simple case.
In order to understand some interesting
behaviors of a mixture model.
The observed pattern here actually are
generalizable to mixture model in general.
But it's much easier to
understand this behavior
when we use A very simple case
like what we are seeing here.
So specifically in this case,
let's assume that
the probability choosing each of
the two models is exactly the same.
So we're going to flip a fair coin
to decide which model to use.
Furthermore, we're going
to assume there are.
Precisely two words, the and text.
Obviously this is a very naive
oversimplification of the actual text,
but again, it's useful to examine
the behavior in such a special case.
So we further assume that the background
model gives probability of
0.9 towards the end text 0.1.
Now, lets also assume that our data is
extremely simple the document has just
two words text and the so now lets right
down the likeable function in such a case.
First, what's the probability of text,
and what's the probably of the.
I hope by this point you'll
be able to write it down.
So the probability of text is
basically the sum over two cases,
where each case corresponds with
to each of the order distribution
and it accounts for
the two ways of generating text.
And inside each case, we have
the probability of choosing the model,
which is 0.5 multiplied by the probability
of observing text from that model.
Similarly, the,
would have a probability of the same form,
just what is different is
the exact probabilities.
So naturally our lateral function
is just a product of the two.
So It's very easy to see that,
once you understand what's
the probability of each word.
Which is also why it's so
important to understand what's exactly
the probability of observing each
word from such a mixture model.
Now, the interesting question now is,
how can we then optimize this likelihood?
Well, you will note that
there are only two variables.
They are precisely the two
probabilities of the two words.
Text [INAUDIBLE] given by theta sub d.
And this is because we have assumed
that all the other parameters are known.
So, now the question is a very
simple algebra question.
So, we have a simple expression
with two variables and
we hope to choose the values of these
two variables to maximize this function.
And the exercises that we have
seen some simple algebra problems.
Note that the two probabilities must
sum to one, so there's some constraint.
If there were no constraint of course,
we would set both probabilities to
their maximum value which would be one,
to maximize, But we can't do that
because text then the must sum to one.
We can't give both a probability of one.
So, now the question is how should
we allocate the probability and
the math between the two words.
What do you think?
Now, it would be useful to look
at this formula For a moment, and
to see what, intuitively,
what we do in order to
do set these probabilities to
maximize the value of this function.
Okay, if we look into this further,
then we see some interesting behavior
of The two component models in that
they will be collaborating to maximize
the probability of the observed data.
Which is dictated by the maximum
likelihood estimator.
But they are also competing in some way,
and in particular,
they would be competing on the words.
And they would tend to back high
probabilities on different words
to avoid this competition in some sense or
to gain advantages in this competition.
So again,
looking at this objective function and
we have a constraint on
the two probabilities.
Now, if you look at
the formula intuitively,
you might feel that you want to set the
probability of text to be somewhat larger.
And this inducing can be work supported
by mathematical fact, which is when
the sum of two variables is
a constant then the product of them
which is maximum when they are equal,
and this is a fact we know from algebra.
Now if we plug that [INAUDIBLE] It
would mean that we have to make the two
probabilities equal.
And when we make them equal and
then if we consider the constraint it
will be easy to solve this problem, and
the solution is the probability of tax
will be .09 and probability is .01.
The probability of text is now much
larger than probability of the, and
this is not the case when
have just one distribution.
And this is clearly because of
the use of the background model,
which assigned the very high probability
to the and low probability to text.
And if you look at the equation
you will see obviously
some interaction of the two
distributions here.
In particular,
you will see in order to make them equal.
And then the probability assigned
by theta sub d must be higher for
a word that has a smaller
probability given by the background.
This is obvious from
examining this equation.
Because the background part is weak for
text.
It's small.
So in order to compensate for that,
we must make the probability for
text given by theta sub D somewhat larger,
so that the two sides can be balanced.
So this is in fact a very
general behavior of this model.
And that is, if one distribution assigns a
high probability to one word than another,
then the other distribution
would tend to do the opposite.
Basically it would discourage other
distributions to do the same And
this is to balance them out so
we can account for all kinds of words.
And this also means that by using
a background model that is fixed into
assigned high probabilities
through background words.
We can indeed encourages the unknown
topical one of this to assign smaller
probabilities for such common words.
Instead put more probability
than this on the content words,
that cannot be explained well
by the background model.
Meaning that they have a very small
probability from the background motor like
text here.
[MUSIC]

[SOUND] Now lets look at another
behaviour of the Mixed Model and
in this case lets look at
the response to data frequencies.
So what you are seeing now is basically
the likelihood of function for
the two word document and
we now in this case the solution is text.
A probability of 0.9 and
the a probability of 0.1.
Now it's interesting to
think about a scenario where we start
adding more words to the document.
So what would happen if we add
many the's to the document?
Now this would change the game, right?
So, how?
Well, picture, what would
the likelihood function look like now?
Well, it start with the likelihood
function for the two words, right?
As we add more words, we know that.
But we have to just multiply
the likelihood function by
additional terms to account for
the additional.
occurrences of that.
Since in this case,
all the additional terms are the,
we're going to just multiply by this term.
Right?
For the probability of the.
And if we have another occurrence of the,
we'd multiply again by the same term,
and so on and forth.
Add as many terms as the number of
the's that we add to the document, d'.
Now this obviously changes
the likelihood function.
So what's interesting is now to think
about how would that change our solution?
So what's the optimal solution now?
Now, intuitively you'd know
the original solution,
pulling the 9 versus pulling the ,will no
longer be optimal for this new function.
Right?
But, the question is how
should we change it.
What general is to sum to one.
So he know we must take away some
probability the mass from one word and
add the probability
mass to the other word.
The question is which word to
have reduce the probability and
which word to have a larger probability.
And in particular,
let's think about the probability of the.
Should it be increased
to be more than 0.1?
Or should we decrease it to less than 0.1?
What do you think?
Now you might want to pause the video
a moment to think more about.
This question.
Because this has to do with understanding
of important behavior of a mixture model.
And indeed,
other maximum likelihood estimator.
Now if you look at the formula for
a moment, then you will see it seems like
another object Function is more
influenced by the than text.
Before, each computer.
So now as you can imagine,
it would make sense to actually
assign a smaller probability for
text and lock it.
To make room for
a larger probability for the.
Why?
Because the is repeated many times.
If we increase it a little bit,
it will have more positive impact.
Whereas a slight decrease of text
will have relatively small impact
because it occurred just one, right?
So this means there is another
behavior that we observe here.
That is high frequency words
generated with high probabilities
from all the distributions.
And, this is no surprise at all,
because after all, we are maximizing
the likelihood of the data.
So the more a word occurs, then it
makes more sense to give such a word
a higher probability because the impact
would be more on the likelihood function.
This is in fact a very general phenomenon
of all the maximum likelihood estimator.
But in this case, we can see as we
see more occurrences of a term,
it also encourages the unknown
distribution theta sub d
to assign a somewhat higher
probability to this word.
Now it's also interesting to think about
the impact of probability of Theta sub B.
The probability of choosing one
of the two component models.
Now we've been so far assuming
that each model is equally likely.
And that gives us 0.5.
But you can again look at this likelihood
function and try to picture what would
happen if we increase the probability
of choosing a background model.
Now you will see these terms for the,
we have a different form where
the probability that would be
even larger because the background has
a high probability for the word and
the coefficient in front of 0.9 which
is now 0.5 would be even larger.
When this is larger,
the overall result would be larger.
And that also makes this
the less important for
theta sub d to increase
the probability before the.
Because it's already very large.
So the impact here of increasing
the probability of the is somewhat
regulated by this coefficient,
the point of i.
If it's larger on the background,
then it becomes less important
to increase the value.
So this means the behavior here,
which is high frequency words tend to get
the high probabilities, are effected or
regularized somewhat by the probability
of choosing each component.
The more likely a component
is being chosen.
It's more important that to have higher
values for these frequent words.
If you have a various small probability of
being chosen, then the incentive is less.
So to summarize,
we have just discussed the mixture model.
And we discussed that the estimation
problem of the mixture model and
particular with this discussed some
general behavior of the estimator and
that means we can expect our
estimator to capture these infusions.
First every component model
attempts to assign high probabilities to
high frequent their words in the data.
And this is to collaboratively
maximize likelihood.
Second, different component models tend to
bet high probabilities on different words.
And this is to avoid a competition or
waste of probability.
And this would allow them to collaborate
more efficiently to maximize
the likelihood.
So, the probability of choosing each
component regulates the collaboration and
the competition between component models.
It would allow some component models
to respond more to the change,
for example, of frequency of
the theta point in the data.
We also talked about the special case
of fixing one component to a background
word distribution, right?
And this distribution can be estimated
by using a collection of documents,
a large collection of English documents,
by using just one distribution and
then we'll just have normalized
frequencies of terms to
give us the probabilities
of all these words.
Now when we use such
a specialized mixture model,
we show that we can effectively get rid
of that one word in the other component.
And that would make this cover
topic more discriminative.
This is also an example of imposing
a prior on the model parameter and
the prior here basically means one model
must be exactly the same as the background
language model and if you recall what we
talked about in Bayesian estimation, and
this prior will allow us to favor a model
that is consistent with our prior.
In fact, if it's not consistent we're
going to say the model is impossible.
So it has a zero prior probability.
That effectively excludes such a scenario.
This is also issue that
we'll talk more later.
[MUSIC]














[SOUND] This lecture is about,
Opinion Mining and Sentiment Analysis,
covering, Motivation.
In this lecture,
we're going to start, talking about,
mining a different kind of knowledge.
Namely, knowledge about the observer or
humans that have generated the text data.
In particular, we're going to talk about
the opinion mining and sentiment analysis.
As we discussed earlier, text data
can be regarded as data generated
from humans as subjective sensors.
In contrast, we have other devices such
as video recorder that can report what's
happening in the real world objective to
generate the viewer data for example.
Now the main difference between test
data and other data, like video data,
is that it has rich opinions,
and the content tends to be subjective
because it's generated from humans.
Now, this is actually a unique advantaged
of text data, as compared with other data,
because the office is a great
opportunity to understand the observers.
We can mine text data to
understand their opinions.
Understand people's preferences,
how people think about something.
So this lecture and the following lectures
will be mainly about how we can mine and
analyze opinions buried
in a lot of text data.
So let's start with
the concept of opinion.
It's not that easy to
formally define opinion, but
mostly we would define
opinion as a subjective
statement describing what a person
believes or thinks about something.
Now, I highlighted quite a few words here.
And that's because it's worth thinking
a little bit more about these words.
And that will help us better
understand what's in an opinion.
And this further helps us to
define opinion more formally.
Which is always needed to computation to
resolve the problem of opinion mining.
So let's first look at the key
word of subjective here.
This is in contrast with objective
statement or factual statement.
Those statements can be proved right or
wrong.
And this is a key differentiating
factor from opinions
which tends to be not
easy to prove wrong or
right, because it reflects what
the person thinks about something.
So in contrast, objective statement can
usually be proved wrong or correct.
For example, you might say this
computer has a screen and a battery.
Now that's something you can check.
It's either having a battery or not.
But in contrast with this, think about
the sentence such as, this laptop has
the best battery or
this laptop has a nice screen.
Now these statements
are more subjective and
it's very hard to prove
whether it's wrong or correct.
So opinion, is a subjective statement.
And next lets look at
the keyword person here.
And that indicates that
is an opinion holder.
Because when we talk about opinion,
it's about an opinion held by someone.
And then we notice that
there is something here.
So that is the target of the opinion.
The opinion is expressed
on this something.
And now, of course, believes or
thinks implies that
an opinion will depend on the culture or
background and the context in general.
Because a person might think
different in a different context.
People from different background
may also think in different ways.
So this analysis shows that there are
multiple elements that we need to include
in order to characterize opinion.
So, what's a basic opinion
representation like?
Well, it should include at
least three elements, right?
Firstly, it has to specify
what's the opinion holder.
So whose opinion is this?
Second, it must also specify the target,
what's this opinion about?
And third, of course,
we want opinion content.
And so what exactly is opinion?
If you can identify these,
we get a basic understanding of opinion
and can already be useful sometimes.
You want to understand further,
we want enriched opinion representation.
And that means we also want to
understand that, for example,
the context of the opinion and
what situation was the opinion expressed.
For example, what time was it expressed?
We, also, would like to, people understand
the opinion sentiment, and this is
to understand that what the opinion tells
us about the opinion holder's feeling.
For example, is this opinion positive,
or negative?
Or perhaps the opinion holder was happy or
was sad, and
so such understanding obvious
to those beyond just Extracting
the opinion content,
it needs some analysis.
So let's take a simple
example of a product review.
In this case, this actually expressed the
opinion holder, and expressed the target.
So its obviously whats opinion holder and
that's just reviewer and its also often
very clear whats the opinion target and
that's the product review for
example iPhone 6.
When the review is posted usually
you can't such information easier.
Now the content, of course,
is a review text that's, in general,
also easy to obtain.
So you can see product reviews are fairly
easy to analyze in terms of obtaining
a basic opinion of representation.
But of course, if you want to get more
information, you might know the Context,
for example.
The review was written in 2015.
Or, we want to know that the sentiment
of this review is positive.
So, this additional understanding of
course adds value to mining the opinions.
Now, you can see in this case the task
is relatively easy and that's
because the opinion holder and the opinion
target have already been identified.
Now let's take a look at
the sentence in the news.
In this case, we have a implicit
holder and a implicit target.
And the tasker is in general harder.
So, we can identify opinion holder here,
and that's the governor of Connecticut.
We can also identify the target.
So one target is Hurricane Sandy, but
there is also another target
mentioned which is hurricane of 1938.
So what's the opinion?
Well, there's a negative sentiment here
that's indicated by words like bad and
worst.
And we can also, then, identify context,
New England in this case.
Now, unlike in the playoff review,
all these elements must be extracted by
using natural RAM processing techniques.
So, the task Is much harder.
And we need a deeper natural
language processing.
And these examples also
suggest that a lot of work can be
easy to done for product reviews.
That's indeed what has happened.
Analyzing and
assembling news is still quite difficult,
it's more difficult than the analysis
of opinions in product reviews.
Now there are also some other
interesting variations.
In fact, here we're going to
examine the variations of opinions,
more systematically.
First, let's think about
the opinion holder.
The holder could be an individual or
it could be group of people.
Sometimes, the opinion
was from a committee.
Or from a whole country of people.
Opinion target accounts will vary a lot.
It can be about one entity,
a particular person, a particular product,
a particular policy, ect.
But it could be about a group of products.
Could be about the products
from a company in general.
Could also be very specific
about one attribute, though.
An attribute of the entity.
For example,
it's just about the battery of iPhone.
It could be someone else's opinion.
And one person might comment on
another person's Opinion, etc.
So, you can see there is a lot of
variation here that will cause
the problem to vary a lot.
Now, opinion content, of course,
can also vary a lot on the surface,
you can identify one-sentence opinion or
one-phrase opinion.
But you can also have longer
text to express an opinion,
like the whole article.
And furthermore we identify
the variation in the sentiment or
emotion damage that's above
the feeding of the opinion holder.
So, we can distinguish a positive
versus negative or mutual or
happy versus sad, separate.
Finally, the opinion
context can also vary.
We can have a simple context, like
different time or different locations.
But there could be also complex contexts,
such as some background
of topic being discussed.
So when opinion is expressed in
particular discourse context, it has to
be interpreted in different ways than
when it's expressed in another context.
So the context can be very [INAUDIBLE] to
entire discourse context of the opinion.
From computational perspective,
we're mostly interested in what opinions
can be extracted from text data.
So, it turns out that we can
also differentiate, distinguish,
different kinds of opinions in text
data from computation perspective.
First, the observer might make
a comment about opinion targeting,
observe the word So
in case we have the author's opinion.
For example,
I don't like this phone at all.
And that's an opinion of this author.
In contrast, the text might also
report opinions about others.
So the person could also Make observation
about another person's opinion and
reported this opinion.
So for example,
I believe he loves the painting.
And that opinion is really about the It is
really expressed by another person here.
So, it doesn't mean this
author loves that painting.
So clearly, the two kinds of opinions
need to be analyzed in different ways,
and sometimes in product reviews,
you can see, although mostly the opinions
are false from this reviewer.
Sometimes, a reviewer might mention
opinions of his friend or her friend.
Another complication is that
there may be indirect opinions or
inferred opinions that can be obtained.
By making inferences on
what's expressed in the text that might
not necessarily look like opinion.
For example, one statement that might be,
this phone ran out of
battery in just one hour.
Now, this is in a way a factual statement
because It's either true or false, right?
You can even verify that,
but from this statement,
one can also infer some negative opinions
about the quality of the battery of
this phone, or the feeling of
the opinion holder about the battery.
The opinion holder clearly wished
that the battery do last longer.
So these are interesting variations
that we need to pay attention to when we
extract opinions.
Also, for
this reason about indirect opinions,
it's often also very useful to extract
whatever the person has said about
the product, and sometimes factual
sentences like these are also very useful.
So, from a practical viewpoint,
sometimes we don't necessarily
extract the subject of sentences.
Instead, again, all the sentences that
are about the opinions are useful for
understanding the person or
understanding the product that we commend.
So the task of opinion mining can be
defined as taking textualized input
to generate a set of
opinion representations.
Each representation we should
identify opinion holder,
target, content, and the context.
Ideally we can also infer opinion
sentiment from the comment and
the context to better understand.
The opinion.
Now often, some elements of
the representation are already known.
I just gave a good example in
the case of product we'd use
where the opinion holder and the opinion
target are often expressly identified.
And that's not why this turns out to be
one of the simplest opinion mining tasks.
Now, it's interesting to think about
the other tasks that might be also simple.
Because those are the cases
where you can easily build
applications by using
opinion mining techniques.
So now that we have talked about what is
opinion mining, we have defined the task.
Let's also just talk a little bit about
why opinion mining is very important and
why it's very useful.
So here, I identify three major reasons,
three broad reasons.
The first is it can help decision support.
It can help us optimize our decisions.
We often look at other people's opinions,
look at read the reviews
in order to make a decisions like
buying a product or using a service.
We also would be interested
in others opinions
when we decide whom to vote for example.
And policy makers,
may also want to know people's
opinions when designing a new policy.
So that's one general,
kind of, applications.
And it's very broad, of course.
The second application is to understand
people, and this is also very important.
For example, it could help
understand people's preferences.
And this could help us
better serve people.
For example, we optimize a product search
engine or optimize a recommender system
if we know what people are interested in,
what people think about product.
It can also help with advertising,
of course, and we can have targeted
advertising if we know what kind of
people tend to like what kind of plot.
Now the third kind of application
can be called voluntary survey.
Now this is most important research
that used to be done by doing surveys,
doing manual surveys.
Question, answer it.
People need to feel informs
to answer their questions.
Now this is directly related to humans
as sensors, and we can usually aggregate
opinions from a lot of humans through
kind of assess the general opinion.
Now this would be very useful for
business intelligence where manufacturers
want to know where their products
have advantages over others.
What are the winning
features of their products,
winning features of competitive products.
Market research has to do with
understanding consumers oppinions.
And this create very useful directive for
that.
Data-driven social science research
can benefit from this because they can
do text mining to understand
the people's opinions.
And if you can aggregate a lot of opinions
from social media, from a lot of, popular
information then you can actually
do some study of some questions.
For example, we can study the behavior of
people on social media on social networks.
And these can be regarded as voluntary
survey done by those people.
In general, we can gain a lot of advantage
in any prediction task because we can
leverage the text data as
extra data above any problem.
And so we can use text based
prediction techniques to help you
make predictions or
improve the accuracy of prediction.
[MUSIC]

[SOUND]
This lecture is about using a time series
as context to potentially
discover causal topics in text.
In this lecture, we're going to continue
discussing Contextual Text Mining.
In particular, we're going to look
at the time series as a context for
analyzing text,
to potentially discover causal topics.
As usual, it started with the motivation.
In this case, we hope to use text
mining to understand a time series.
Here, what you are seeing is Dow Jones
Industrial Average stock price curves.
And you'll see a sudden drop here.
Right.
So one would be interested knowing
what might have caused the stock
market to crash.
Well, if you know the background, and
you might be able to figure it out if you
look at the time stamp, or there are other
data that can help us think about.
But the question here is can
we get some clues about this
from the companion news stream?
And we have a lot of news data
that generated during that period.
So if you do that we might
actually discover the crash.
After it happened,
at the time of the September 11 attack.
And that's the time when there
is a sudden rise of the topic
about September 11
happened in news articles.
Here's another scenario where we want
to analyze the Presidential Election.
And this is the time series that are from
the Presidential Prediction Market.
For example, I write a trunk of market
would have stocks for each candidate.
And if you believe one candidate that will
win then you tend to buy the stock for
that candidate, causing the price
of that candidate to increase.
So, that's a nice way to actual do
survey of people's opinions about
these candidates.
Now, suppose you see something
drop of price for one candidate.
And you might also want to know what
might have caused the sudden drop.
Or in a social science study, you might
be interested in knowing what method
in this election,
what issues really matter to people.
Now again in this case,
we can look at the companion news
stream and ask for the question.
Are there any clues in the news stream
that might provide insight about this?
So for example,
we might discover the mention of tax cut
has been increasing since that point.
So maybe,
that's related to the drop of the price.
So all these cases are special
cases of a general problem of joint
analysis of text and a time series
data to discover causal topics.
The input in this case is time series plus
text data that are produced in the same
time period, the companion text stream.
And this is different from
the standard topic models,
where we have just to text collection.
That's why we see time series here,
it serves as context.
Now, the output that we
want to generate is the topics
whose coverage in the text stream has
strong correlations with the time series.
For example, whenever the topic is
managing the price tends to go down, etc.
Now we call these topics Causal Topics.
Of course, they're not,
strictly speaking, causal topics.
We are never going to be able to
verify whether they are causal, or
there's a true causal relationship here.
That's why we put causal
in quotation marks.
But at least they are correlating
topics that might potentially
explain the cause and
humans can certainly further analyze such
topics to understand the issue better.
And the output would contain topics
just like in topic modeling.
But we hope that these topics are not
just the regular topics with.
These topics certainly don't have to
explain the data of the best in text, but
rather they have to explain
the data in the text.
Meaning that they have to reprehend
the meaningful topics in text.
Cement but also more importantly,
they should be correlated with external
hand series that's given as a context.
So to understand how we solve this
problem, let's first adjust to
solve the problem with reactive
topic model, for example PRSA.
And we can apply this to text stream and
with some extension like a CPRSA or
Contextual PRSA.
Then we can discover these
topics in the correlation and
also discover their coverage over time.
So, one simple solution is,
to choose the topics from
this set that have the strongest
correlation with the external time series.
But this approach is not
going to be very good.
Why?
Because
awareness pictured to the topics is
that they will discover by PRSA or LDA.
And that means the choice of
topics will be very limited.
And we know these models try to maximize
the likelihood of the text data.
So those topics tend to be the major
topics that explain the text data well.
aAnd they are not necessarily
correlated with time series.
Even if we get the best one, the most
correlated topics might still not be so
interesting from causal perspective.
So here in this work site here,
a better approach was proposed.
And this approach is called
Iterative Causal Topic Modeling.
The idea is to do an iterative
adjustment of topic,
discovered by topic models using
time series to induce a product.
So here's an illustration on
how this work, how this works.
Take the text stream as input and
then apply regular topic modeling
to generate a number of topics.
Let's say four topics.
Shown here.
And then we're going to use
external time series to assess
which topic is more causally related or
correlated with the external time series.
So we have something that rank them.
And we might think that topic one and
topic four are more correlated and
topic two and topic three are not.
Now we could have stopped here and
that would be just like what the simple
approached that I talked about earlier
then we can get to these topics and
call them causal topics.
But as I also explained that these
topics are unlikely very good
because they are general topics that
explain the whole text connection.
They are not necessary.
The best topics are correlated
with our time series.
So what we can do in this approach
is to first zoom into word level and
we can look into each word and
the top ranked word listed for each topic.
Let's say we take Topic 1
as the target examined.
We know Topic 1 is correlated
with the time series.
Or is at least the best that we could
get from this set of topics so far.
And we're going to look at the words
in this topic, the top words.
And if the topic is correlated
with the Time Series,
there must be some words that are highly
correlated with the Time Series.
So here, for example,
we might discover W1 and W3 are positively
correlated with Time Series, but
W2 and W4 are negatively correlated.
So, as a topic, and it's not good to mix
these words with different correlations.
So we can then for
the separate of these words.
We are going to get all the red words
that indicate positive correlations.
W1 and W3.
And
we're going to also get another sub topic.
If you want.
That represents a negatively
correlated words, W2 and W4.
Now, these subtopics, or these variations
of topics, based on the correlation
analysis, are topics that are still quite
related to the original topic, Topic 1.
But they are already deviating,
because of the use of time series
information for bias selection of words.
So then in some sense,
well we should expect so, some sense
more correlated with the time
series than the original Topic 1.
Because the Topic 1 has mixed words,
here we separate them.
So each of these two subtopics
can be expected to be better
coherent in this time series.
However, they may not be so
coherent as it mention.
So the idea here is to go back
to topic model by using these
each as a prior to further
guide the topic modeling.
And that's to say we ask our topic
models now discover topics that
are very similar to each
of these two subtopics.
And this will cause a bias toward more
correlate to the topics was a time series.
Of course then we can apply topic models
to get another generation of topics.
And that can be further ran to the base of
the time series to set after the highly
correlated topics.
And then we can further analyze
the components at work in the topic and
then try to analyze.word
level correlation.
And then get the even more
correlated subtopics that can be
further fed into the process as prior
to drive the topic of model discovery.
So this whole process is just a heuristic
way of optimizing causality and
coherence, and that's our ultimate goal.
Right?
So here you see the pure topic
models will be very good at
maximizing topic coherence,
the topics will be all meaningful.
If we only use causality test,
or correlation measure,
then we might get a set words that
are strongly correlate with time series,
but they may not
necessarily mean anything.
It might not be cementric connected.
So, that would be at the other extreme,
on the top.
Now, the ideal is to get the causal
topic that's scored high,
both in topic coherence and
also causal relation.
In this approach,
it can be regarded as an alternate
way to maximize both sine engines.
So when we apply the topic models
we're maximizing the coherence.
But when we decompose the topic
model words into sets
of words that are very strong
correlated with the time series.
We select the most strongly correlated
words with the time series.
We are pushing the model
back to the causal
dimension to make it
better in causal scoring.
And then, when we apply
the selected words as a prior
to guide a topic modeling, we again
go back to optimize the coherence.
Because topic models, we ensure the next
generation of topics to be coherent and
we can iterate when they're optimized
in this way as shown on this picture.
So the only I think a component that you
haven't seen such a framework is how
to measure the causality.
Because the rest is just talking more on.
So let's have a little bit
of discussion of that.
So here we show that.
And let's say we have a topic
about government response here.
And then we just talking more of we can
get coverage of the topic over time.
So, we have a time series, X sub t.
Now, we also have, are give a time series
that represents external information.
It's a non text time series, Y sub t.
It's the stock prices.
Now the the question
here is does Xt cause Yt?
Well in other words, we want to match
the causality relation between the two.
Or maybe just measure
the correlation of the two.
There are many measures that
we can use in this framework.
For example, pairs in correlation
is a common use measure.
And we got to consider time lag here so
that we can try to
capture causal relation.
Using somewhat past data and
using the data in the past
to try to correlate with the data on
points of y that represents the future,
for example.
And by introducing such lag, we can
hopefully capture some causal relation by
even using correlation measures
like person correlation.
But a common use, the measure for
causality here is Granger Causality Test.
And the idea of this test
is actually quite simple.
Basically you're going to have
all the regressive model to
use the history information
of Y to predict itself.
And this is the best we could
without any other information.
So we're going to build such a model.
And then we're going to add some history
information of X into such model.
To see if we can improve
the prediction of Y.
If we can do that with a statistically
significant difference.
Then we just say X has some
causal inference on Y,
or otherwise it wouldn't have causal
improvement of prediction of Y.
If, on the other hand,
the difference is insignificant and
that would mean X does not really
have a cause or relation why.
So that's the basic idea.
Now, we don't have time to explain
this in detail so you could read, but
you would read at this cited reference
here to know more about this measure.
It's a very convenient used measure.
Has many applications.
So next, let's look at some simple
results generated by this approach.
And here the data is
the New York Times and
in the time period of June
2000 through December of 2011.
And here the time series we used
is stock prices of two companies.
American Airlines and Apple and
the goal is to see if we inject
the sum time series contest,
whether we can actually get topics
that are wise for the time series.
Imagine if we don't use any input,
we don't use any context.
Then the topics from New York
times discovered by PRSA would be
just general topics that
people talk about in news.
All right.
Those major topics in the news event.
But here you see these topics are indeed
biased toward each time series.
And particularly if you look
at the underlined words here
in the American Airlines result,
and you see airlines,
airport, air, united trade,
or terrorism, etc.
So it clearly has topics that are more
correlated with the external time series.
On the right side,
you see that some of the topics
are clearly related to Apple, right.
So you can see computer, technology,
software, internet, com, web, etc.
So that just means the time series
has effectively served as a context
to bias the discovery of topics.
From another perspective,
these results help us on what people
have talked about in each case.
So not just the people,
what people have talked about,
but what are some topics that might be
correlated with their stock prices.
And so these topics can serve
as a starting point for
people to further look into issues and
you'll find the true causal relations.
Here are some other results from analyzing
Presidential Election time series.
The time series data here is
from Iowa Electronic market.
And that's a prediction market.
And the data is the same.
New York Times from May
2000 to October 2000.
That's for
2000 presidential campaign election.
Now, what you see here
are the top three words in significant
topics from New York Times.
And if you look at these topics, and they
are indeed quite related to the campaign.
Actually the issues
are very much related to
the important issues of
this presidential election.
Now here I should mention that the text
data has been filtered by using
only the articles that mention
these candidate names.
It's a subset of these news articles.
Very different from
the previous experiment.
But the results here clearly show
that the approach can uncover some
important issues in that
presidential election.
So tax cut, oil energy, abortion and
gun control are all known
to be important issues in
that presidential election.
And that was supported by some
literature in political science.
And also I was discussing Wikipedia,
right.
So basically the results show
that the approach can effectively
discover possibly causal topics
based on the time series data.
So there are two suggested readings here.
One is the paper about this iterative
topic modeling with time series feedback.
Where you can find more details
about how this approach works.
And the second one is reading
about Granger Casuality text.
So in the end, let's summarize
the discussion of Text-based Prediction.
Now, Text-based prediction
is generally very useful for
big data applications that involve text.
Because they can help us inform
new knowledge about the world.
And the knowledge can go beyond
what's discussed in the text.
As a result can also support
optimizing of our decision making.
And this has a wider spread application.
Text data is often combined with
non-text data for prediction.
because, for this purpose,
the prediction purpose,
we generally would like to combine
non-text data and text data together,
as much cruel as possible for prediction.
And so as a result during
the analysis of text and
non-text is very necessary and
it's also very useful.
Now when we analyze text data
together with non-text data,
we can see they can help each other.
So non-text data, provide a context for
mining text data, and
we discussed a number of techniques for
contextual text mining.
And on the other hand,
a text data can also help interpret
patterns discovered from non-text data,
and this is called a pattern annotation.
In general,
this is a very active research topic, and
there are new papers being published.
And there are also many open
challenges that have to be solved.
[MUSIC]

This lecture is a summary
of this whole course.
First, let's revisit the topics
that we covered in this course.
In the beginning, we talked about
the natural language processing and
how it can enrich text representation.
We then talked about how to mine
knowledge about the language,
natural language used to express the,
what's observing the world in text and
data.
In particular, we talked about
how to mine word associations.
We then talked about how
to analyze topics in text.
How to discover topics and analyze them.
This can be regarded as
knowledge about observed world,
and then we talked about how to mine
knowledge about the observer and
particularly talk about the, how to
mine opinions and do sentiment analysis.
And finally, we will talk about
the text-based prediction, which has to
do with predicting values of other real
world variables based on text data.
And in discussing this, we will also
discuss the role of non-text data,
which can contribute additional
predictors for the prediction problem,
and also can provide context for
analyzing text data, and
in particular we talked about how
to use context to analyze topics.
So here are the key high-level
take away messages from this cost.
I going to go over these major topics and
point out what are the key take-away
messages that you should remember.
First the NLP and text representation.
You should realize that NLP
is always very important for
any text replication because it
enriches text representation.
The more NLP the better text
representation we can have.
And this further enables more
accurate knowledge discovery,
to discover deeper knowledge,
buried in text.
However, the current estate of art
of natural energy processing is,
still not robust enough.
So, as an result,
the robust text mining technologies today,
tend to be based on world [INAUDIBLE].
And tend to rely a lot
on statistical analysis,
as we've discussed in this course.
And you may recall we've mostly
used word based representations.
And we've relied a lot on
statistical techniques,
statistical learning
techniques particularly.
In word-association mining and
analysis the important points first,
we are introduced the two concepts for
two basic and
complementary relations of words,
paradigmatic and syntagmatic relations.
These are actually very general
relations between elements sequences.
If you take it as meaning
elements that occur in similar
context in the sequence and elements
that tend to co-occur with each other.
And these relations might be also
meaningful for other sequences of data.
We also talked a lot about
test the similarity then we
discuss how to discover
paradynamic similarities compare
the context of words discover
words that share similar context.
At that point level,
we talked about representing text
data with a vector space model.
And we talked about some retrieval
techniques such as BM25 for
measuring similarity of text and
for assigning weights to terms,
tf-idf weighting, et cetera.
And this part is well-connected
to text retrieval.
There are other techniques that
can be relevant here also.
The next point is about
co-occurrence analysis of text, and
we introduce some information
theory concepts such as entropy,
conditional entropy,
and mutual information.
These are not only very useful for
measuring the co-occurrences of words,
they are also very useful for
analyzing other kind of data, and
they are useful for, for example, for
feature selection in text
categorization as well.
So this is another important concept,
good to know.
And then we talked about
the topic mining and analysis, and
that's where we introduce in
the probabilistic topic model.
We spent a lot of time to
explain the basic topic model,
PLSA in detail and this is, those are the
basics for understanding LDA which is.
Theoretically, a more opinion model, but
we did not have enough time to really
go in depth in introducing LDA.
But in practice,
PLSA seems as effective as LDA and
it's simpler to implement and
it's also more efficient.
In this part of Wilson videos is some
general concepts that would be useful to
know, one is generative model,
and this is a general method for
modeling text data and
modeling other kinds of data as well.
And we talked about the maximum life
erase data, the EM algorithm for
solving the problem of
computing maximum estimator.
So, these are all general techniques
that tend to be very useful
in other scenarios as well.
Then we talked about the text
clustering and the text categorization.
Those are two important building blocks
in any text mining application systems.
In text with clustering we talked
about how we can solve the problem by
using a slightly different mixture module
than the probabilistic topic model.
and we then also prefer to
view the similarity based
approaches to test for cuss word.
In categorization we also talk
about the two kinds of approaches.
One is generative classifies
that rely on to base word to
infer the condition of or
probability of a category given text data,
in deeper we'll introduce you should
use [INAUDIBLE] base in detail.
This is the practical use for technique,
for a lot of text, capitalization tasks.
We also introduce the some
discriminative classifiers,
particularly logistical regression,
can nearest labor and SBN.
They also very important, they are very
popular, they are very useful for
text capitalization as well.
In both parts, we'll also discuss
how to evaluate the results.
Evaluation is quite important because if
the matches that you use don't really
reflect the volatility of the method then
it would give you misleading results so
its very important to
get the variation right.
And we talked about variation of
categorization in detail was a lot of
specific measures.
Then we talked about the sentiment
analysis and the paradigm and
that's where we introduced
sentiment classification problem.
And although it's a special
case of text recalculation, but
we talked about how to extend or
improve the text recalculation method
by using more sophisticated features that
would be needed for sentiment analysis.
We did a review of some common use for
complex features for text analysis, and
then we also talked about how to
capture the order of these categories,
in sentiment classification, and
in particular we introduced ordinal
logistical regression then we also talked
about Latent Aspect Rating Analysis.
This is an unsupervised way of using
a generative model to understand and
review data in more detail.
In particular, it allows us to
understand the composed ratings of
a reviewer on different
aspects of a topic.
So given text reviews
with overall ratings,
the method allows even further
ratings on different aspects.
And it also allows us to infer,
the viewers laying their
weights on these aspects or
which aspects are more important to
a viewer can be revealed as well.
And this enables a lot of
interesting applications.
Finally, in the discussion of prediction,
we mainly talk about the joint mining
of text and non text data, as they
are both very important for prediction.
We particularly talked about how text data
can help non-text data and vice versa.
In the case of using non-text
data to help text data analysis,
we talked about
the contextual text mining.
We introduced the contextual PLSA as a
generalizing or generalized model of PLSA
to allows us to incorporate the context
of variables, such as time and location.
And this is a general way to allow us
to reveal a lot of interesting topic
of patterns in text data.
We also introduced the net PLSA,
in this case we used social network or
network in general of text
data to help analyze puppets.
And finally we talk about how
can be used as context to
mine potentially causal
Topics in text layer.
Now, in the other way of using text to
help interpret patterns
discovered from LAM text data,
we did not really discuss anything in
detail but just provide a reference but
I should stress that that's after a very
important direction to know about,
if you want to build a practical
text mining systems,
because understanding and
interpreting patterns is quite important.
So this is a summary of the key
take away messages, and
I hope these will be very
useful to you for building any
text mining applications or to you for
the starting of these algorithms.
And this should provide a good basis for
you to read from your research papers,
to know more about more of allowance for
other organisms or
to invent new hours in yourself.
So to know more about this topic,
I would suggest you to look
into other areas in more depth.
And during this short period
of time of this course,
we could only touch the basic concepts,
basic principles, of text mining and
we emphasize the coverage
of practical algorithms.
And this is after the cost
of covering algorithms and
in many cases we omit the discussion
of a lot of algorithms.
So to learn more about the subject
you should definitely learn more
about the natural language process
because this is foundation for
all text based applications.
The more NLP you can do, the better
the additional text that you can get, and
then the deeper knowledge
you can discover.
So this is very important.
The second area you should look into
is the Statistical Machine Learning.
And these techniques are now
the backbone techniques for
not just text analysis applications but
also for NLP.
A lot of NLP techniques are nowadays
actually based on supervised machinery.
So, they are very important
because they are a key
to also understanding some
advancing NLP techniques and
naturally they will provide more tools for
doing text analysis in general.
Now, a particularly interesting area,
called deep learning has attracted
a lot of attention recently.
It has also shown promise
in many application areas,
especially in speech and vision, and
has been applied to text data as well.
So, for example, recently there has
work on using deep learning to do
segment analysis to
achieve better accuracy.
So that's one example of [INAUDIBLE]
techniques that we weren't able to cover,
but that's also very important.
And the other area that has emerged
in status learning is the water and
baring technique, where they can
learn better recognition of words.
And then these better recognitions will
allow you confuse similarity of words.
As you can see,
this provides directly a way to discover
the paradigmatic relations of words.
And results that people have got,
so far, are very impressive.
That's another promising technique
that we did not have time to touch,
but, of course,
whether these new techniques
would lead to practical useful techniques
that work much better than the current
technologies is still an open
question that has to be examined.
And no serious evaluation
has been done yet.
In, for example, examining
the practical value of word embedding,
other than word similarity and
basic evaluation.
But nevertheless,
these are advanced techniques
that surely will make impact
in text mining in the future.
So its very important to
know more about these.
Statistical learning is also the key to
predictive modeling which is very crucial
for many big data applications and we did
not talk about that predictive modeling
component but this is mostly about
the regression or categorization
techniques and this is another reason
why statistical learning is important.
We also suggest that you learn more about
data mining, and that's simply because
general data mining algorithms can always
be applied to text data, which can be
regarded as as special
case of general data.
So there are many applications
of data mining techniques.
In particular for example, pattern
discovery would be very useful to generate
the interesting features for test analysis
and the reason that an information network
that mining techniques can also be used
to analyze text information at work.
So these are all good to know.
In order to develop effective
text analysis techniques.
And finally, we also recommend you to
learn more about the text retrieval,
information retrieval, of search engines.
This is especially important if you
are interested in building practical text
application systems.
And a search ending would
be an essential system
component in any text-based applications.
And that's because texts data
are created for humans to consume.
So humans are at the best position
to understand text data and
it's important to have human in the loop
in big text data applications, so
it can in particular help text
mining systems in two ways.
One is through effectively reduce
the data size from a large collection to
a small collection with the most
relevant text data that only matter for
the particular interpretation.
So the other is to provide a way to
annotate it, to explain parents,
and this has to do with
knowledge providence.
Once we discover some knowledge,
we have to figure out whether or
not the discovery is really reliable.
So we need to go back to
the original text to verify that.
And that is why the search
engine is very important.
Moreover, some techniques
of information retrieval,
for example BM25, vector space and
are also very useful for text data mining.
We only mention some of them,
but if you know more about
text retrieval you'll see that there
are many techniques that are used for it.
Another technique that it's used for
is indexing technique that enables quick
response of search engine to a user's
query, and such techniques can be
very useful for building efficient
text mining systems as well.
So, finally, I want to remind
you of this big picture for
harnessing big text data that I showed
you at your beginning of the semester.
So in general, to deal with
a big text application system,
we need two kinds text,
text retrieval and text mining.
And text retrieval, as I explained,
is to help convert big text data into
a small amount of most relevant data for
a particular problem, and can also help
providing knowledge provenance,
help interpreting patterns later.
Text mining has to do with further
analyzing the relevant data to discover
the actionable knowledge that can be
directly useful for decision making or
many other tasks.
So this course covers text mining.
And there's a companion course
called Text Retrieval and
Search Engines that covers text retrieval.
If you haven't taken that course,
it would be useful for you to take it,
especially if you are interested
in building a text caching system.
And taking both courses will give you
a complete set of practical skills for
building such a system.
So in [INAUDIBLE]
I just would like to thank you for
taking this course.
I hope you have learned useful knowledge
and skills in test mining and [INAUDIBLE].
As you see from our discussions
there are a lot of opportunities for
this kind of techniques and
there are also a lot of open channels.
So I hope you can use what you have
learned to build a lot of use for
applications will benefit society and
to also join
the research community to discover new
techniques for text mining and benefits.
Thank you.
[MUSIC]

[NOISE]
This
lecture is about
the sentiment classification.
If we assume that
most of the elements in the opinion
representation are all ready known,
then our only task may be just a sentiment
classification, as shown in this case.
So suppose we know who's the opinion
holder and what's the opinion target,
and also know the content and the context
of the opinion, then we mainly need to
decide the opinion
sentiment of the review.
So this is a case of just using sentiment
classification for understanding opinion.
Sentiment classification can be
defined more specifically as follows.
The input is opinionated text object,
the output is typically a sentiment label,
or a sentiment tag, and
that can be designed in two ways.
One is polarity analysis, where we have
categories such as positive, negative,
or neutral.
The other is emotion
analysis that can go beyond
a polarity to characterize
the feeling of the opinion holder.
In the case of polarity analysis,
we sometimes
also have numerical ratings as you
often see in some reviews on the web.
Five might denote the most positive, and
one maybe the most negative, for example.
In general, you have just disk holder
categories to characterize the sentiment.
In emotion analysis, of course,
there are also different ways for
design the categories.
The six most frequently
used categories are happy,
sad, fearful, angry,
surprised, and disgusted.
So as you can see, the task is essentially
a classification task, or categorization
task, as we've seen before, so it's
a special case of text categorization.
This also means any textual categorization
method can be used to do sentiment
classification.
Now of course if you just do that,
the accuracy may not be good
because sentiment classification
does requires some improvement over
regular text categorization technique,
or simple text categorization technique.
In particular,
it needs two kind of improvements.
One is to use more sophisticated features
that may be more appropriate for
sentiment tagging as I
will discuss in a moment.
The other is to consider
the order of these categories, and
especially in polarity analysis,
it's very clear there's an order here,
and so these categories
are not all that independent.
There's order among them, and so
it's useful to consider the order.
For example, we could use
ordinal regression to do that,
and that's something that
we'll talk more about later.
So now, let's talk about some features
that are often very useful for
text categorization and
text mining in general, but
some of them are especially also
needed for sentiment analysis.
So let's start from the simplest one,
which is character n-grams.
You can just have a sequence
of characters as a unit,
and they can be mixed with different n's,
different lengths.
All right, and
this is a very general way and
very robust way to
represent the text data.
And you could do that for
any language, pretty much.
And this is also robust to spelling
errors or recognition errors, right?
So if you misspell a word by one character
and this representation actually would
allow you to match this word when
it occurs in the text correctly.
Right, so misspell the word and
the correct form can be matched because
they contain some common
n-grams of characters.
But of course such a recommendation
would not be as discriminating as words.
So next, we have word n-grams,
a sequence of words and again,
we can mix them with different n's.
Unigram's are actually often very
effective for a lot of text processing
tasks, and it's mostly because words
are word designed features by humans for
communication, and so
they are often good enough for many tasks.
But it's not good, or not sufficient for
sentiment analysis clearly.
For example, we might see a sentence like,
it's not good or
it's not as good as something else, right?
So in such a case if you
just take a good and
that would suggest positive that's not
good, all right so it's not accurate.
But if you take a bigram, not good
together, and then it's more accurate.
So longer n-grams are generally more
discriminative, and they're more specific.
If you match it, and it says a lot, and
it's accurate it's unlikely,
very ambiguous.
But it may cause overfitting because with
such very unique features that machine
oriented program can easily pick up
such features from the training set and
to rely on such unique features
to distinguish the categories.
And obviously, that kind of classify, one
would generalize word to future there when
such discriminative features
will not necessarily occur.
So that's a problem of
overfitting that's not desirable.
We can also consider part of speech tag,
n-grams if we can do part of
speech tagging an, for example,
adjective noun could form a pair.
We can also mix n-grams of words and
n-grams of part of speech tags.
For example, the word great might be
followed by a noun, and this could become
a feature, a hybrid feature, that could
be useful for sentiment analysis.
So next we can also have word classes.
So these classes can be syntactic like a
part of speech tags, or could be semantic,
and they might represent concepts in
the thesaurus or ontology, like WordNet.
Or they can be recognized the name
entities, like people or place, and
these categories can be used to enrich
the presentation as additional features.
We can also learn word clusters and
parodically, for example,
we've talked about the mining
associations of words.
And so we can have cluster of
paradigmatically related words or
syntaxmatically related words, and
these clusters can be features to
supplement the word base representation.
Furthermore, we can also have
frequent pattern syntax, and
these could be frequent word set,
the words that
form the pattern do not necessarily
occur together or next to each other.
But we'll also have locations where
the words my occur more closely together,
and such
patterns provide a more discriminative
features than words obviously.
And they may also generalize better
than just regular n-grams because they
are frequent.
So you expected them to
occur also in tested data.
So they have a lot of advantages, but
they might still face the problem
of overfeeding as the features
become more complex.
This is a problem in general, and the same
is true for parse tree-based features,
when you can use a parse tree to derive
features such as frequent subtrees, or
paths, and
those are even more discriminating, but
they're also are more likely
to cause over fitting.
And in general, pattern discovery
algorithm's are very useful for
feature construction because they allow
us to search in a large space of possible
features that are more complex than
words that are sometimes useful.
So in general, natural language
processing is very important that
they derive complex features, and
they can enrich text representation.
So for example,
this is a simple sentence that I showed
you a long time ago in another lecture.
So from these words we can only
derive simple word n-grams,
representations or character n-grams.
But with NLP,
we can enrich the representation
with a lot of other information such
as part of speech tags, parse trees or
entities, or even speech act.
Now with such enriching information
of course, then we can generate a lot
of other features, more complex features
like a mixed grams of a word and
the part of speech tags, or
even a part of a parse tree.
So in general, feature design actually
affects categorization accuracy
significantly, and it's a very important
part of any machine learning application.
In general, I think it would be
most effective if you can combine
machine learning, error analysis, and
domain knowledge in design features.
So first you want to
use the main knowledge,
your understanding of the problem,
the design seed features, and
you can also define a basic feature space
with a lot of possible features for
the machine learning program to work on,
and machine can be applied to select
the most effective features or
construct the new features.
That's feature learning, and
these features can then be further
analyzed by humans through error analysis.
And you can look at
the categorization errors, and
then further analyze what features can
help you recover from those errors,
or what features cause overfitting and
cause those errors.
And so this can lead into
feature validation that will
revised the feature set,
and then you can iterate.
And we might consider using
a different features space.
So NLP enriches text
recognition as I just said, and
because it enriches the feature space,
it allows much larger such a space
of features and there are also many,
many more features that can be
very useful for a lot of tasks.
But be careful not to use a lot
of category features because
it can cause overfitting,
or otherwise you would
have to training careful
not to let overflow happen.
So a main challenge in design features,
a common challenge is to optimize
a trade off between exhaustivity and
the specificity, and this trade off
turns out to be very difficult.
Now exhaustivity means we want
the features to actually have
high coverage of a lot of documents.
And so in that sense,
you want the features to be frequent.
Specifity requires the feature
to be discriminative, so
naturally infrequent the features
tend to be more discriminative.
So this really cause a trade off between
frequent versus infrequent features.
And that's why a featured
design is usually odd.
And that's probably the most important
part in machine learning any
problem in particularly in our case,
for text categoration or
more specifically
the senitment classification.
[MUSIC]

[NOISE] This lecture is about the ordinal
logistic regression for
sentiment analysis.
So, this is our problem set up for a
typical sentiment classification problem.
Or more specifically a rating prediction.
We have an opinionated text document d as
input, and we want to generate as output,
a rating in the range of 1 through k so
it's a discrete rating, and
this is a categorization problem.
We have k categories here.
Now we could use a regular text for
categorization technique
to solve this problem.
But such a solution would not consider the
order and dependency of the categories.
Intuitively, the features that can
distinguish category 2 from 1,
or rather rating 2 from 1,
may be similar to
those that can distinguish k from k-1.
For example, positive words
generally suggest a higher rating.
When we train categorization
problem by treating these categories as
independent we would not capture this.
So what's the solution?
Well in general we can order to classify
and there are many different approaches.
And here we're going to
talk about one of them that
called ordinal logistic regression.
Now, let's first think about how
we use logistical regression for
a binary sentiment.
A categorization problem.
So suppose we just wanted to distinguish
a positive from a negative and
that is just a two category
categorization problem.
So the predictors are represented as X and
these are the features.
And there are M features all together.
The feature value is a real number.
And this can be representation
of a text document.
And why it has two values,
binary response variable 0 or 1.
1 means X is positive,
0 means X is negative.
And then of course this is a standard
two category categorization problem.
We can apply logistical regression.
You may recall that in logistical
regression, we assume the log
of probability that the Y is equal to one,
is
assumed to be a linear function
of these features, as shown here.
So this would allow us to also write
the probability of Y equals one, given X
in this equation that you
are seeing on the bottom.
So that's a logistical function and
you can see it relates
this probability to,
probability that y=1
to the feature values.
And of course beta i's
are parameters here, so this is
just a direct application of logistical
regression for binary categorization.
What if we have multiple categories,
multiple levels?
Well we have to use such a binary
logistical regression problem
to solve this multi
level rating prediction.
And the idea is we can introduce
multiple binary class files.
In each case we asked
the class file to predict the,
whether the rating is j or above,
or the rating's lower than j.
So when Yj is equal to 1,
it means rating is j or above.
When it's 0,
that means the rating is Lower than j.
So basically if we want to predict
a rating in the range of 1-k,
we first have one classifier to
distinguish a k versus others.
And that's our classifier one.
And then we're going to have another
classifier to distinguish it.
At k-1 from the rest.
That's Classifier 2.
And in the end, we need a Classifier
to distinguish between 2 and 1.
So altogether we'll have k-1 classifiers.
Now if we do that of course then
we can also solve this problem
and the logistical regression program
will be also very straight forward
as you have just seen
on the previous slide.
Only that here we have more parameters.
Because for each classifier,
we need a different set of parameters.
So now the logistical regression
classifies index by J,
which corresponds to a rating level.
And I have also used of
J to replace beta 0.
And this is to.
Make the notation more consistent,
than was what we can show in
the ordinal logistical regression.
So here we now have basically k minus one
regular logistic regression classifiers.
Each has it's own set of parameters.
So now with this approach,
we can now do ratings as follows.
After we have trained these k-1
logistic regression classifiers,
separately of course,
then we can take a new instance and
then invoke a classifier
sequentially to make the decision.
So first let look at the classifier
that corresponds to level of rating K.
So this classifier will tell
us whether this object should
have a rating of K or about.
If probability according to this
logistical regression classifier is
larger than point five,
we're going to say yes.
The rating is K.
Now, what if it's not as
large as twenty-five?
Well, that means the rating's below K,
right?
So now,
we need to invoke the next classifier,
which tells us whether
it's above K minus one.
It's at least K minus one.
And if the probability is
larger than twenty-five,
then we'll say, well, then it's k-1.
What if it says no?
Well, that means the rating
would be even below k-1.
And so we're going to just keep
invoking these classifiers.
And here we hit the end when we need
to decide whether it's two or one.
So this would help us solve the problem.
Right?
So we can have a classifier that would
actually give us a prediction of a rating
in the range of 1 through k.
Now unfortunately such a strategy is not
an optimal way of solving this problem.
And specifically there are two
problems with this approach.
So these equations are the same as.
You have seen before.
Now the first problem is that there
are just too many parameters.
There are many parameters.
Now, can you count how many
parameters do we have exactly here?
Now this may be a interesting exercise.
To do.
So
you might want to just pause the video and
try to figure out the solution.
How many parameters do I have for
each classifier?
And how many classifiers do we have?
Well you can see the, and so
it is that for each classifier we have
n plus one parameters, and we have k
minus one classifiers all together,
so the total number of parameters is
k minus one multiplied by n plus one.
That's a lot.
A lot of parameters, so when
the classifier has a lot of parameters,
we would in general need a lot of data
out to actually help us, training data,
to help us decide the optimal
parameters of such a complex model.
So that's not ideal.
Now the second problems
is that these problems,
these k minus 1 plus fives,
are not really independent.
These problems are actually dependent.
In general, words that are positive
would make the rating higher
for any of these classifiers.
For all these classifiers.
So we should be able to take
advantage of this fact.
Now the idea of ordinal logistical
regression is precisely that.
The key idea is just
the improvement over the k-1
independent logistical
regression classifiers.
And that idea is to tie
these beta parameters.
And that means we are going to
assume the beta parameters.
These are the parameters that indicated
the inference of those weights.
And we're going to assume these
beta values are the same for
all the K- 1 parameters.
And this just encodes our intuition that,
positive words in general would
make a higher rating more likely.
So this is intuitively assumptions,
so reasonable for our problem setup.
And we have this order
in these categories.
Now in fact, this would allow us
to have two positive benefits.
One is it's going to reduce
the number of families significantly.
And the other is to allow us
to share the training data.
Because all these parameters
are similar to be equal.
So these training data, for
different classifiers can then be
shared to help us set
the optimal value for beta.
So we have more data to help
us choose a good beta value.
So what's the consequence,
well the formula would look very similar
to what you have seen before only that,
now the beta parameter has just one
index that corresponds to the feature.
It no longer has the other index that
corresponds to the level of rating.
So that means we tie them together.
And there's only one set of better
values for all the classifiers.
However, each classifier still
has the distinct R for value.
The R for parameter.
Except it's different.
And this is of course needed to predict
the different levels of ratings.
So R for sub j is different it
depends on j, different than j,
has a different R value.
But the rest of the parameters,
the beta i's are the same.
So now you can also ask the question,
how many parameters do we have now?
Again, that's an interesting
question to think about.
So if you think about it for a moment, and
you will see now, the param,
we have far fewer parameters.
Specifically we have M plus K minus one.
Because we have M, beta values, and
plus K minus one of our values.
So let's just look basically,
that's basically the main idea of
ordinal logistical regression.
So, now, let's see how we can use such
a method to actually assign ratings.
It turns out that with this, this idea of
tying all the parameters, the beta values.
We also end up by having
a similar way to make decisions.
And more specifically now, the criteria
whether the predictor probabilities
are at least 0.5 above,
and now is equivalent to
whether the score of
the object is larger than or
equal to negative authors of j,
as shown here.
Now, the scoring function is just
taking the linear combination of
all the features with
the divided beta values.
So, this means now we can simply make
a decision of rating, by looking at
the value of this scoring function,
and see which bracket it falls into.
Now you can see the general
decision rule is thus,
when the score is in the particular
range of all of our values,
then we will assign the corresponding
rating to that text object.
So in this approach,
we're going to score the object
by using the features and
trained parameter values.
This score will then be
compared with a set of trained
alpha values to see which
range the score is in.
And then,
using the range, we can then decide which
rating the object should be getting.
Because, these ranges of alpha
values correspond to the different
levels of ratings, and that's from
the way we train these alpha values.
Each is tied to some level of rating.
[MUSIC]

[MUSIC]
This lecture is about the Latent Aspect
Rating Analysis for Opinion Mining and
Sentiment Analysis.
In this lecture,
we're going to continue discussing
Opinion Mining and Sentiment Analysis.
In particular, we're going to introduce
Latent Aspect Rating Analysis
which allows us to perform detailed
analysis of reviews with overall ratings.
So, first is motivation.
Here are two reviews that you often
see in the net about the hotel.
And you see some overall ratings.
In this case,
both reviewers have given five stars.
And, of course,
there are also reviews that are in text.
Now, if you just look at these reviews,
it's not very clear whether the hotel is
good for its location or for its service.
It's also unclear why
a reviewer liked this hotel.
What we want to do is to
decompose this overall rating into
ratings on different aspects such as
value, rooms, location, and service.
So, if we can decompose
the overall ratings,
the ratings on these different aspects,
then, we
can obtain a more detailed understanding
of the reviewer's opinionsabout the hotel.
And this would also allow us to rank
hotels along different dimensions
such as value or rooms.
But, in general, such detailed
understanding will reveal more information
about the user's preferences,
reviewer's preferences.
And also, we can understand better
how the reviewers view this
hotel from different perspectives.
Now, not only do we want to
infer these aspect ratings,
we also want to infer the aspect weights.
So, some reviewers may care more about
values as opposed to the service.
And that would be a case.
like what's shown on the left for
the weight distribution,
where you can see a lot of
weight is places on value.
But others care more for service.
And therefore, they might place
more weight on service than value.
The reason why this is
also important is because,
do you think about a five star on value,
it might still be very expensive if the
reviewer cares a lot about service, right?
For this kind of service,
this price is good, so
the reviewer might give it a five star.
But if a reviewer really cares
about the value of the hotel,
then the five star, most likely,
would mean really cheap prices.
So, in order to interpret the ratings
on different aspects accurately,
we also need to know these aspect weights.
When they're combined together,
we can have a more detailed
understanding of the opinion.
So the task here is to get these reviews
and their overall ratings as input,
and then,
generate both the aspect ratings,
the compose aspect ratings, and
the aspect rates as output.
And this is a problem called
Latent Aspect Rating Analysis.
So the task, in general,
is given a set of review articles about
the topic with overall ratings, and
we hope to generate three things.
One is the major aspects
commented on in the reviews.
Second is ratings on each aspect,
such as value and room service.
And third is the relative weights placed
on different aspects by the reviewers.
And this task has a lot of applications,
and if you can do this,
and it will enable a lot of applications.
I just listed some here.
And later, I will show you some results.
And, for example,
we can do opinion based entity ranking.
We can generate an aspect-level
opinion summary.
We can also analyze reviewers preferences,
compare them or
compare their preferences
on different hotels.
And we can do personalized
recommendations of products.
So, of course, the question is
how can we solve this problem?
Now, as in other cases of
these advanced topics,
we won’t have time to really
cover the technique in detail.
But I’m going to give you a brisk,
basic introduction to the technique
development for this problem.
So, first step, we’re going to talk about
how to solve the problem in two stages.
Later, we’re going to also mention that
we can do this in the unified model.
Now, take this review with
the overall rating as input.
What we want to do is, first,
we're going to segment the aspects.
So we're going to pick out what words
are talking about location, and
what words are talking
about room condition, etc.
So with this, we would be able
to obtain aspect segments.
In particular, we're going to
obtain the counts of all the words
in each segment, and
this is denoted by C sub I of W and D.
Now this can be done by using seed
words like location and room or
price to retrieve
the [INAUDIBLE] in the segments.
And then, from those segments,
we can further mine correlated
words with these seed words and
that would allow us to segmented
the text into segments,
discussing different aspects.
But, of course,
later, as we will see, we can also use
[INAUDIBLE] models to do the segmentation.
But anyway, that's the first stage,
where the obtain the council
of words in each segment.
In the second stage,
which is called Latent Rating Regression,
we're going to use these words and
their frequencies in different
aspects to predict the overall rate.
And this predicting happens in two stages.
In the first stage,
we're going to use the [INAUDIBLE] and
the weights of these words in each
aspect to predict the aspect rating.
So, for example, if in your discussion
of location, you see a word like,
amazing, mentioned many times,
and it has a high weight.
For example, here, 3.9.
Then, it will increase
the Aspect Rating for location.
But, another word like, far,
which is an acted weight,
if it's mentioned many times,
and it will decrease the rating.
So the aspect ratings, assume that it
will be a weighted combination of these
word frequencies where the weights
are the sentiment weights of the words.
Of course, these sentimental weights
might be different for different aspects.
So we have, for each aspect, a set of
term sentiment weights as shown here.
And that's in order by beta sub I and W.
In the second stage or second step,
we're going to assume that the overall
rating is simply a weighted
combination of these aspect ratings.
So we're going to assume we have aspect
weights to the [INAUDIBLE] sub i of d,
and this will be used to take a weighted
average of the aspect ratings,
which are denoted by r sub i of d.
And we're going to assume the overall
rating is simply a weighted
average of these aspect ratings.
So this set up allows us to predict
the overall rating based on
the observable frequencies.
So on the left side,
you will see all these observed
information, the r sub d and the count.
But on the right side,
you see all the information in
that range is actually latent.
So, we hope to discover that.
Now, this is a typical case of
a generating model where would embed
the interesting variables
in the generated model.
And then, we're going to set up
a generation probability for
the overall rating given
the observed words.
And then, of course, we can adjust these
parameter values including betas Rs and
alpha Is in order to maximize
the probability of the data.
In this case, the conditional probability
of the observed rating given the document.
So we have seen such cases before in, for
example, PISA,
where we predict a text data.
But here, we're predicting the rating,
and the parameters,
of course, are very different.
But we can see, if we can uncover
these parameters, it would be nice,
because r sub i of d is precise as
the ratings that we want to get.
And these are the composer
ratings on different aspects.
[INAUDIBLE] sub I D is precisely
the aspect weights that we
hope to get as a byproduct,
that we also get the beta factor, and
these are the [INAUDIBLE] factor,
the sentiment weights of words.
So more formally,
the data we are modeling here is a set of
review documents with overall ratings.
And each review document denote by a d,
and the overall ratings denote by r sub d.
And d pre-segments turn
into k aspect segments.
And we're going to use ci(w,d) to denote
the count of word w in aspect segment i.
Of course, it's zero if the word
doesn't occur in the segment.
Now, the model is going to
predict the rating based on d.
So, we're interested in the provisional
problem of r sub-d given d.
And this model is set up as follows.
So r sub-d is assumed the two
follow a normal distribution
doesn't mean that denotes
actually await the average
of the aspect of ratings r
Sub I of d as shown here.
This normal distribution is
a variance of data squared.
Now, of course,
this is just our assumption.
The actual rating is not necessarily
anything thing this way.
But as always, when we make this
assumption, we have a formal way to
model the problem and that allows us
to compute the interest in quantities.
In this case, the aspect ratings and
the aspect weights.
Now, the aspect rating as
you see on the [INAUDIBLE]
is assuming that will be
a weight of sum of these weights.
Where the weight is just
the [INAUDIBLE] of the weight.
So as I said,
the overall rating is assumed to be
a weighted average of aspect ratings.
Now, these other values, r for
sub I of D, or denoted together
by other vector that depends on D is
that the token of specific weights.
And we’re going to assume that
this vector itself is drawn
from another Multivariate Gaussian
distribution,
with mean denoted by a Mu factor,
and covariance metrics sigma here.
Now, so this means, when we generate our
overall rating, we're going to first draw
a set of other values from this
Multivariate Gaussian Prior distribution.
And once we get these other values,
we're going to use then the weighted
average of aspect ratings as
the mean here to use the normal
distribution to generate
the overall rating.
Now, the aspect rating, as I just said,
is the sum of the sentiment weights of
words in aspect, note that here the
sentiment weights are specific to aspect.
So, beta is indexed by i,
and that's for aspect.
And that gives us a way to model
different segment of a word.
This is neither because
the same word might have
positive sentiment for another aspect.
It's also used for see what parameters
we have here beta sub i and
w gives us the aspect-specific
sentiment of w.
So, obviously,
that's one of the important parameters.
But, in general, we can see we have these
parameters, beta values, the delta,
and the Mu, and sigma.
So, next, the question is, how can
we estimate these parameters and, so
we collectively denote all
the parameters by lambda here.
Now, we can, as usual,
use the maximum likelihood estimate, and
this will give us the settings
of these parameters,
that with a maximized observed ratings
condition of their respective reviews.
And of, course,
this would then give us all the useful
variables that we
are interested in computing.
So, more specifically, we can now,
once we estimate the parameters,
we can easily compute the aspect rating,
for aspect the i or sub i of d.
And that's simply to take all of the words
that occurred in the segment, i,
and then take their counts and
then multiply that by the center of
the weight of each word and take a sum.
So, of course, this time would be zero for
words that are not occurring in and
that's why were going to take the sum
of all the words in the vocabulary.
Now what about the s factor weights?
Alpha sub i of d, well,
it's not part of our parameter.
Right?
So we have to use that to compute it.
And in this case, we can use the Maximum
a Posteriori to compute this alpha value.
Basically, we're going to maximize the
product of the prior of alpha according
to our assumed Multivariate Gaussian
Distribution and the likelihood.
In this case,
the likelihood rate is the probability of
generating this observed overall rating
given this particular alpha value and
some other parameters, as you see here.
So for more details about this model,
you can read this paper cited here.
[MUSIC]

[SOUND] This lecture is a continued
discussion of
Latent Aspect Rating Analysis.
Earlier, we talked about how to solve
the problem of LARA in two stages.
But we first do segmentation
of different aspects.
And then we use a latent regression
model to learn the aspect ratings and
then later the weight.
Now it's also possible to develop
a unified generative model for
solving this problem, and
that is we not only model the generational
over-rating based on text.
We also model the generation of text,
and so
a natural solution would
be to use topic model.
So given the entity,
we can assume there are aspects that
are described by word distributions.
Topics.
And then we an use a topic model to model
the generation of the reviewed text.
I will assume words in the review text
are drawn from these distributions.
In the same way as we assumed for
generating model like PRSA.
And then we can then plug in
the latent regression model to
use the text to further
predict the overrating.
And that means when we first
predict the aspect rating and
then combine them with aspect weights
to predict the overall rating.
So this would give us
a unified generated model,
where we model both the generation of text
and the overall ready condition on text.
So we don't have time to discuss
this model in detail as in
many other cases in this part of the cause
where we discuss the cutting edge topics,
but there's a reference site here
where you can find more details.
So now I'm going to show you some
simple results that you can get
by using these kind of generated models.
First, it's about rating decomposition.
So here, what you see
are the decomposed ratings for
three hotels that have
the same overall rating.
So if you just look at the overall rating,
you can't really tell much
difference between these hotels.
But by decomposing these
ratings into aspect ratings
we can see some hotels have higher
ratings for some dimensions,
like value, but others might score better
in other dimensions, like location.
And so this can give you detailed
opinions at the aspect level.
Now here, the ground-truth is
shown in the parenthesis, so
it also allows you to see whether
the prediction is accurate.
It's not always accurate but It's mostly
still reflecting some of the trends.
The second result you compare
different reviewers on the same hotel.
So the table shows the decomposed ratings
for two reviewers about same hotel.
Again their high level
overall ratings are the same.
So if you just look at the overall
ratings, you don't really get that much
information about the difference
between the two reviewers.
But after you decompose the ratings,
you can see clearly that they have
high scores on different dimensions.
So this shows that model can review
differences in opinions of different
reviewers and such a detailed
understanding can help us understand
better about reviewers and also better
about their feedback on the hotel.
This is something very interesting,
because this is in some
sense some byproduct.
In our problem formulation,
we did not really have to do this.
But the design of the generating
model has this component.
And these are sentimental weights for
words in different aspects.
And you can see the highly weighted words
versus the negatively loaded weighted
words here for
each of the four dimensions.
Value, rooms, location, and cleanliness.
The top words clearly make sense, and
the bottom words also make sense.
So this shows that with this approach,
we can also learn sentiment
information directly from the data.
Now, this kind of lexicon is very useful
because in general, a word like long,
let's say, may have different sentiment
polarities for different context.
So if I say the battery life of this
laptop is long, then that's positive.
But if I say the rebooting time for
the laptop is long, that's bad, right?
So even for
reviews about the same product, laptop,
the word long is ambiguous, it could
mean positive or it could mean negative.
But this kind of lexicon, that we can
learn by using this kind of generated
models, can show whether a word is
positive for a particular aspect.
So this is clearly very useful, and in
fact such a lexicon can be directly used
to tag other reviews about hotels or
tag comments about hotels in
social media like Tweets.
And what's also interesting is that since
this is almost completely unsupervised,
well assuming the reviews whose
overall rating are available And
then this can allow us to learn form
potentially larger amount of data on
the internet to reach sentiment lexicon.
And here are some results to
validate the preference words.
Remember the model can infer wether
a reviewer cares more about service or
the price.
Now how do we know whether
the inferred weights are correct?
And this poses a very difficult
challenge for evaluation.
Now here we show some
interesting way of evaluating.
What you see here are the prices
of hotels in different cities, and
these are the prices of hotels that are
favored by different groups of reviewers.
The top ten are the reviewers
was the highest
inferred value to other aspect ratio.
So for example value versus location,
value versus room, etcetera.
Now the top ten of the reviewers that
have the highest ratios by this measure.
And that means these reviewers
tend to put a lot of
weight on value as compared
with other dimensions.
So that means they really
emphasize on value.
The bottom ten on the other
hand of the reviewers.
The lowest ratio, what does that mean?
Well it means these reviewers have
put higher weights on other aspects
than value.
So those are people that cared about
another dimension and they didn't care so
much the value in some sense, at least
as compared with the top ten group.
Now these ratios are computer based on
the inferred weights from the model.
So now you can see the average prices
of hotels favored by top ten reviewers
are indeed much cheaper than those
that are favored by the bottom ten.
And this provides some indirect way
of validating the inferred weights.
It just means the weights are not random.
They are actually meaningful here.
In comparison,
the average price in these three cities,
you can actually see the top ten
tend to have below average in price,
whereas the bottom half, where they care
a lot about other things like a service or
room condition tend to have hotels
that have higher prices than average.
So with these results we can build
a lot of interesting applications.
For example, a direct application would be
to generate the rated aspect, the summary,
and because of the decomposition we
have now generated the summaries for
each aspect.
The positive sentences the negative
sentences about each aspect.
It's more informative than original review
that just has an overall rating and
review text.
Here are some other results
about the aspects that's covered
from reviews with no ratings.
These are mp3 reviews,
and these results show that the model
can discover some interesting aspects.
Commented on low overall ratings versus
those higher overall per ratings.
And they care more about
the different aspects.
Or they comment more on
the different aspects.
So that can help us discover for
example, consumers'
trend in appreciating different
features of products.
For example, one might have discovered
the trend that people tend to
like larger screens of cell phones or
light weight of laptop, etcetera.
Such knowledge can be useful for
manufacturers to design their
next generation of products.
Here are some interesting results
on analyzing users rating behavior.
So what you see is average weights
along different dimensions by
different groups of reviewers.
And on the left side you see the weights
of viewers that like the expensive hotels.
They gave the expensive hotels 5 Stars,
and
you can see their average rates
tend to be more for some service.
And that suggests that people like
expensive hotels because of good service,
and that's not surprising.
That's also another way to
validate it by inferred weights.
If you look at the right side where,
look at the column of 5 Stars.
These are the reviewers that
like the cheaper hotels, and
they gave cheaper hotels five stars.
As we expected and
they put more weight on value,
and that's why they like
the cheaper hotels.
But if you look at the, when they didn't
like expensive hotels, or cheaper hotels,
then you'll see that they tended to
have more weights on the condition of
the room cleanness.
So this shows that by using this model,
we can infer some
information that's very hard to obtain
even if you read all the reviews.
Even if you read all the reviews it's
very hard to infer such preferences or
such emphasis.
So this is a case where text mining
algorithms can go beyond what
humans can do, to review
interesting patterns in the data.
And this of course can be very useful.
You can compare different hotels,
compare the opinions from different
consumer groups, in different locations.
And of course, the model is general.
It can be applied to any
reviews with overall ratings.
So this is a very useful
technique that can
support a lot of text mining applications.
Finally the results of applying this
model for personalized ranking or
recommendation of entities.
So because we can infer the reviewers
weights on different dimensions,
we can allow a user to actually
say what do you care about.
So for example, I have a query
here that shows 90% of the weight
should be on value and 10% on others.
So that just means I don't
care about other aspect.
I just care about getting a cheaper hotel.
My emphasis is on the value dimension.
Now what we can do with such query
is we can use reviewers that we
believe have a similar preference
to recommend a hotels for you.
How can we know that?
Well, we can infer the weights of
those reviewers on different aspects.
We can find the reviewers whose
weights are more precise,
of course inferred rates
are similar to yours.
And then use those reviewers to
recommend hotels for you and
this is what we call personalized or
rather query specific recommendations.
Now the non-personalized
recommendations now shown on the top,
and you can see the top results generally
have much higher price, than the lower
group and that's because when the
reviewer's cared more about the value as
dictated by this query they tended
to really favor low price hotels.
So this is yet
another application of this technique.
It shows that by doing text mining
we can understand the users better.
And once we can handle users better
we can solve these users better.
So to summarize our discussion
of opinion mining in general,
this is a very important topic and
with a lot of applications.
And as a text sentiment
analysis can be readily done by
using just text categorization.
But standard technique
tends to not be enough.
And so we need to have enriched
feature implementation.
And we also need to consider
the order of those categories.
And we'll talk about ordinal
regression for some of these problem.
We have also assume that
the generating models are powerful for
mining latent user preferences.
This in particular in the generative
model for mining latent regression.
And we embed some interesting
preference information and
send the weights of words in the model
as a result we can learn most
useful information when
fitting the model to the data.
Now most approaches have been proposed and
evaluated.
For product reviews, and that was because
in such a context, the opinion holder and
the opinion target are clear.
And they are easy to analyze.
And there, of course,
also have a lot of practical applications.
But opinion mining from news and
social media is also important, but that's
more difficult than analyzing review data,
mainly because the opinion holders and
opinion targets are all interested.
So that calls for
natural management processing
techniques to uncover them accurately.
Here are some suggested readings.
The first two are small books that
are of some use of this topic,
where you can find a lot of discussion
about other variations of the problem and
techniques proposed for
solving the problem.
The next two papers about
generating models for
rating the aspect rating analysis.
The first one is about solving
the problem using two stages, and
the second one is about a unified model
where the topic model is integrated
with the regression model to solve
the problem using a unified model.
[MUSIC]

[SOUND] This lecture is about
the Text-Based Prediction.
In this lecture, we're going to
start talking about the mining
a different kind of knowledge,
as you can see here on this slide.
Namely we're going to use text
data to infer values of some other
variables in the real world that may
not be directly related to the text.
Or only remotely related to text data.
So this is very different
from content analysis or
topic mining where we directly
characterize the content of text.
It's also different from opinion mining or
sentiment analysis,
which still have to do is
characterizing mostly the content.
Only that we focus more
on the subject of content
which reflects what we know
about the opinion holder.
But this only provides limited
review of what we can predict.
In this lecture and the following
lectures, we're going to talk more about
how we can predict more
Information about the world.
How can we get the sophisticated patterns
of text together with other kind of data?
It would be useful first to take a look
at the big picture of prediction, and
data mining in general, and
I call this data mining loop.
So the picture that you are seeing right
now is that there are multiple sensors,
including human sensors,
to report what we have seen in
the real world in the form of data.
Of course the data in the form
of non-text data, and text data.
And our goal is to see if we
can predict some values of
important real world
variables that matter to us.
For example, someone's house condition,
or the weather, or etc.
And so these variables would be important
because we might want to act on that.
We might want to make
decisions based on that.
So how can we get from the data
to these predicted values?
Well in general we'll first have to do
data mining and analysis of the data.
Because we, in general, should treat
all the data that we collected
in such a prediction problem set up.
We are very much interested in
joint mining of non-text and
text data, which should
combine all the data together.
And then, through analysis,
generally there
are multiple predictors of this
interesting variable to us.
And we call these features.
And these features can then be
put into a predictive model,
to actually predict the value
of any interesting variable.
So this then allows us
to change the world.
And so
this basically is the general process for
making a prediction based on data,
including the test data.
Now it's important to emphasize
that a human actually
plays a very important
role in this process.
Especially because of
the involvement of text data.
So human first would be involved
in the mining of the data.
It would control the generation
of these features.
And it would also help us
understand the text data,
because text data are created
to be consumed by humans.
Humans are the best in consuming or
interpreting text data.
But when there are, of course, a lot of
text data then machines have to help and
that's why we need to do text data mining.
Sometimes machines can see patterns in
a lot of data that humans may not see.
But in general human would
play an important role in
analyzing some text data, or applications.
Next, human also must be involved
in predictive model building and
adjusting or testing.
So in particular, we will have a lot
of domain knowledge about the problem
of prediction that we can build
into this predictive model.
And then next, of course, when we have
predictive values for the variables,
then humans would be involved in
taking actions to change a word or
make decisions based on
these particular values.
And finally it's interesting
that a human could be involved
in controlling the sensors.
And this is so that we can
adjust to the sensors to collect
the most useful data for prediction.
So that's why I call
this data mining loop.
Because as we perturb the sensors,
it'll collect the new data and
more useful data then we will
obtain more data for prediction.
And this data generally will help
us improve the predicting accuracy.
And in this loop,
humans will recognize what additional
data will need to be collected.
And machines, of course,
help humans identify what data
should be collected next.
In general, we want to collect data
that is most useful for learning.
And there was actually a subarea in
machine learning called active learning
that has to do with this.
How do you identify data
points that would be most helpful
in machine learning programs?
If you can label them, right?
So, in general,
you can see there is a loop here from
data acquisition to data analysis.
Or data mining to prediction of values.
And to take actions to change the word,
and then observe what happens.
And then you can then
decide what additional data
have to be collected by
adjusting the sensors.
Or from the prediction arrows,
you can also note what additional data
we need to acquire in order to
improve the accuracy of prediction.
And this big picture is
actually very general and
it's reflecting a lot of important
applications of big data.
So, it's useful to keep that in mind
while we are looking at some text
mining techniques.
So from text mining perspective and
we're interested in text based prediction.
Of course, sometimes texts
alone can make predictions.
And this is most useful for
prediction about human behavior or
human preferences or opinions.
But in general text data will be
put together as non-text data.
So the interesting questions
here would be, first,
how can we design effective predictors?
And how do we generate such
effective predictors from text?
And this question has been addressed to
some extent in some previous lectures
where we talked about what kind of
features we can design for text data.
And it has also been
addressed to some extent by
talking about the other knowledge
that we can mine from text.
So, for example, topic mining can be very
useful to generate the patterns or topic
based indicators or predictors that can
be further fed into a predictive model.
So topics can be intermediate
recognition of text.
That would allow us to do
design high level features or
predictors that are useful for
prediction of some other variable.
It may be also generated from original
text data, it provides a much better
implementation of the problem and
it serves as more effective predictors.
And similarly similar analysis can
lead to such predictors, as well.
So, those other data mining or
text mining algorithms can be
used to generate predictors.
The other question is, how can we join
the mine text and non-text data together?
Now, this is a question that
we have not addressed yet.
So, in this lecture,
and in the following lectures,
we're going to address this problem.
Because this is where we can generate much
more enriched features for prediction.
And allows us to review a lot of
interesting knowledge about the world.
These patterns that
are generated from text and
non-text data themselves can sometimes,
already be useful for prediction.
But, when they are put together
with many other predictors
they can really help
improving the prediction.
Basically, you can see text-based
prediction can actually serve as a unified
framework to combine many text mining and
analysis techniques.
Including topic mining and any content
mining techniques or segment analysis.
The goal here is mainly to evoke
values of real-world variables.
But in order to achieve the goal
we can do some other preparations.
And these are subtasks.
So one subtask could mine the content
of text data, like topic mining.
And the other could be to mine
knowledge about the observer.
So sentiment analysis, opinion.
And both can help provide predictors for
the prediction problem.
And of course we can also add non-text
data directly to the predicted model, but
then non-text data also helps
provide a context for text analyst.
And that further improves the topic
mining and the opinion analysis.
And such improvement often leads to more
effective predictors for our problems.
It would enlarge the space of patterns
of opinions of topics that we can
mine from text and
that we'll discuss more later.
So the joint analysis of text and
non-text data can be actually
understood from two perspectives.
One perspective,
we have non-text can help with testimony.
Because non-text data can
provide a context for
mining text data provide a way to
partition data in different ways.
And this leads to a number of type of
techniques for contextual types of mining.
And that's the mine text in
the context defined by non-text data.
And you see this reference here, for
a large body of work, in this direction.
And I will need to highlight some of them,
in the next lectures.
Now, the other perspective is text data
can help with non-text
data mining as well.
And this is because text
data can help interpret
patterns discovered from non-text data.
Let's say you discover some frequent
patterns from non-text data.
Now we can use the text data
associated with instances
where the pattern occurs as well as
text data that is associated with
instances where the pattern
doesn't look up.
And this gives us two sets of text data.
And then we can see what's the difference.
And this difference in text data is
interpretable because text content is
easy to digest.
And that difference might
suggest some meaning for
this pattern that we
found from non-text data.
So, it helps interpret such patterns.
And this technique is
called pattern annotation.
And you can see this reference
listed here for more detail.
So here are the references
that I just mentioned.
The first is reference for
pattern annotation.
The second is, Qiaozhu Mei's
dissertation on contextual text mining.
It contains a large body of work on
contextual text mining techniques.
[MUSIC]

[SOUND]
This
lecture is about
the contextual text mining.
Contextual text mining
is related to multiple
kinds of knowledge that we mine from
text data, as I'm showing here.
It's related to topic mining because you
can make topics associated with context,
like time or location.
And similarly, we can make opinion
mining more contextualized,
making opinions connected to context.
It's related to text based prediction
because it allows us to combine non-text
data with text data to derive
sophisticated predictors for
the prediction problem.
So more specifically, why are we
interested in contextual text mining?
Well, that's first because text
often has rich context information.
And this can include direct context such
as meta-data, and also indirect context.
So, the direct context can grow
the meta-data such as time,
location, authors, and
source of the text data.
And they're almost always available to us.
Indirect context refers to additional
data related to the meta-data.
So for example, from office,
we can further obtain additional
context such as social network of
the author, or the author's age.
Such information is not in general
directly related to the text, yet
through the process, we can connect them.
There could be other text
data from the same source,
as this one through the other text can
be connected with this text as well.
So in general, any related data
can be regarded as context.
So there could be removed or
rated for context.
And so what's the use?
What is text context used for?
Well, context can be used to partition
text data in many interesting ways.
It can almost allow us to partition
text data in other ways as we need.
And this is very important
because this allows
us to do interesting comparative analyses.
It also in general,
provides meaning to the discovered topics,
if we associate the text with context.
So here's illustration of how context
can be regarded as interesting
ways of partitioning of text data.
So here I just showed some research
papers published in different years.
On different venues,
different conference names here listed on
the bottom like the SIGIR or ACL, etc.
Now such text data can be partitioned
in many interesting ways
because we have context.
So the context here just includes time and
the conference venues.
But perhaps we can include
some other variables as well.
But let's see how we can partition
this interesting of ways.
First, we can treat each
paper as a separate unit.
So in this case, a paper ID and the,
each paper has its own context.
It's independent.
But we can also treat all the papers
within 1998 as one group and
this is only possible because
of the availability of time.
And we can partition data in this way.
This would allow us to compare topics for
example, in different years.
Similarly, we can partition
the data based on the menus.
We can get all the SIGIR papers and
compare those papers with the rest.
Or compare SIGIR papers with KDD papers,
with ACL papers.
We can also partition the data to obtain
the papers written by authors in the U.S.,
and that of course,
uses additional context of the authors.
And this would allow us to then
compare such a subset with
another set of papers written
by also seen in other countries.
Or we can obtain a set of
papers about text mining, and
this can be compared with
papers about another topic.
And note that these
partitionings can be also
intersected with each other to generate
even more complicated partitions.
And so in general, this enables
discovery of knowledge associated with
different context as needed.
And in particular,
we can compare different contexts.
And this often gives us
a lot of useful knowledge.
For example, comparing topics over time,
we can see trends of topics.
Comparing topics in different
contexts can also reveal differences
about the two contexts.
So there are many interesting questions
that require contextual text mining.
Here I list some very specific ones.
For example, what topics have
been getting increasing attention
recently in data mining research?
Now to answer this question,
obviously we need to analyze
text in the context of time.
So time is context in this case.
Is there any difference in the responses
of people in different regions
to the event, to any event?
So this is a very broad
an answer to this question.
In this case of course,
location is the context.
What are the common research
interests of two researchers?
In this case, authors can be the context.
Is there any difference in the research
topics published by authors in the USA and
those outside?
Now in this case,
the context would include the authors and
their affiliation and location.
So this goes beyond just
the author himself or herself.
We need to look at the additional
information connected to the author.
Is there any difference in the opinions
of all the topics expressed on
one social network and another?
In this case, the social network of
authors and the topic can be a context.
Other topics in news data that
are correlated with sudden changes in
stock prices.
In this case, we can use a time series
such as stock prices as context.
What issues mattered in the 2012
presidential campaign, or
presidential election?
Now in this case,
time serves again as context.
So, as you can see,
the list can go on and on.
Basically, contextual text mining
can have many applications.
[MUSIC]

[MUSIC]
This lecture is about
a specific technique for
Contextual Text Mining called Contextual
Probabilistic Latent Semantic Analysis.
In this lecture, we're going to continue
discussing Contextual Text Mining.
And we're going to introduce Contextual
Probablitistic Latent Semantic Analysis
as exchanging of POS for
doing contextual text mining.
Recall that in contextual text mining
we hope to analyze topics in text,
in consideration of the context so
that we can associate the topics with a
property of the context were interesting.
So in this approach, contextual
probabilistic latent semantic analysis,
or CPLSA, the main idea is to
express to the add interesting
context variables into a generating model.
Recall that before when we generate
the text we generally assume we'll start
wIth some topics, and
then assemble words from some topics.
But here, we're going to add context
variables, so that the coverage of topics,
and also the content of topics
would be tied in context.
Or in other words, we're going to let
the context Influence both coverage and
the content of a topic.
The consequences that this will enable
us to discover contextualized topics.
Make the topics more interesting,
more meaningful.
Because we can then have topics
that can be interpreted as
specifically to a particular
context that we are interested in.
For example, a particular time period.
As an extension of PLSA model,
CPLSA does the following changes.
Firstly it would model the conditional
likelihood of text given context.
That clearly suggests that the generation
of text would then depend on context,
and that allows us to bring
context into the generative model.
Secondly, it makes two specific
assumptions about the dependency
of topics on context.
One is to assume that depending on
the context, depending on different time
periods or different locations, we assume
that there are different views of a topic
or different versions of word
descriptions that characterize a topic.
And this assumption allows
us to discover different
variations of the same topic
in different contexts.
The other is that we assume the topic
coverage also depends on the context.
That means depending on the time or
location, we might cover
topics differently.
Again, this dependency
would then allow us to
capture the association of
topics with specific contexts.
We can still use the EM algorithm to solve
the problem of parameter estimation.
And in this case, the estimated parameters
would naturally contain context variables.
And in particular,
a lot of conditional probabilities
of topics given certain context.
And this is what allows you
to do contextual text mining.
So this is the basic idea.
Now, we don't have time to
introduce this model in detail,
but there are references here that you
can look into to know more detail.
Here I just want to explain the high
level ideas in more detail.
Particularly I want to explain
the generation process.
Of text data that has context
associated in such a model.
So as you see here, we can assume
there are still multiple topics.
For example, some topics might represent
a themes like a government response,
donation Or the city of New Orleans.
Now this example is in the context
of Hurricane Katrina and
that hit New Orleans.
Now as you can see we
assume there are different
views associated with each of the topics.
And these are shown as View 1,
View 2, View 3.
Each view is a different
version of word distributions.
And these views are tied
to some context variables.
For example, tied to the location Texas,
or the time July 2005,
or the occupation of the author
being a sociologist.
Now, on the right side, now we assume
the document has context information.
So the time is known to be July 2005.
The location is Texas, etc.
And such context information is
what we hope to model as well.
So we're not going to just model the text.
And so one idea here is to model
the variations of top content and
various content.
And this gives us different views
of the water distributions.
Now on the bottom you will see the theme
coverage of top Coverage might also vary
according to these context
because in the case
of a location like Texas, people might
want to cover the red topics more.
That's New Orleans.
That's visualized here.
But in a certain time period,
maybe Particular topic and
will be covered more.
So this variation is
also considered in CPLSA.
So to generate the searcher document With
context, with first also choose a view.
And this view of course now could
be from any of these contexts.
Let's say, we have taken this
view that depends on the time.
In the middle.
So now, we will have a specific
version of word distributions.
Now, you can see some probabilities
of words for each topic.
Now, once we have chosen a view,
now the situation will be very similar
to what happened in standard ((PRSA))
We assume we have got word distribution
associated with each topic, right?
And then next, we will also choose
a coverage from the bottom, so
we're going to choose a particular
coverage, and that coverage,
before is fixed in PLSA, and
assigned to a particular document.
Each document has just one
coverage distribution.
Now here, because we consider context, so
the distribution of topics or the coverage
of Topics can vary depending on the
context that has influenced the coverage.
So, for example,
we might pick a particular coverage.
Let's say in this case we picked
a document specific coverage.
Now with the coverage and
these word distributions
we can generate a document in
exactly the same way as in PLSA.
So what it means, we're going to
use the coverage to choose a topic,
to choose one of these three topics.
Let's say we have picked the yellow topic.
Then we'll draw a word from this
particular topic on the top.
Okay, so
we might get a word like government.
And then next time we might
choose a different topic, and
we'll get donate, etc.
Until we generate all the words.
And this is basically
the same process as in PLSA.
So the main difference is
when we obtain the coverage.
And the word distribution,
we let the context influence our choice So
in other words we have extra switches
that are tied to these contacts that will
control the choices of different views
of topics and the choices of coverage.
And naturally the model we have
more parameters to estimate.
But once we can estimate those
parameters that involve the context,
then we will be able to understand
the context specific views of topics,
or context specific coverages of topics.
And this is precisely what we
want in contextual text mining.
So here are some simple results.
From using such a model.
Not necessary exactly the same model,
but similar models.
So on this slide you see
some sample results of
comparing news articles about Iraq War and
Afghanistan War.
Now we have about 30 articles on Iraq
wa,r and 26 articles on Afghanistan war.
And in this case,
the goal is to review the common topic.
It's covered in both sets of articles and
the differences of variations of
the topic in each of the two collections.
So in this case the context is explicitly
specified by the topic or collection.
And we see the results here
show that there is a common
theme that's corresponding to
Cluster 1 here in this column.
And there is a common theme indicting that
United Nations is involved in both Wars.
It's a common topic covered
in both sets of articles.
And that's indicated by the high
probability words shown here, united and
nations.
Now if you know the background,
of course this is not surprising and
this topic is indeed very
relevant to both wars.
If you look at the column further and
then what's interesting's that the next
two cells of word
distributions actually tell us
collection specific variations
of the topic of United Nations.
So it indicates that the Iraq War,
United Nations was more involved
in weapons factions, whereas in
the Afghanistan War it was more involved
in maybe aid to Northern Alliance.
It's a different variation of
the topic of United Nations.
So this shows that by
bringing the context.
In this case different the walls or
different the collection of texts.
We can have topical variations
tied to these contexts,
to review the differences of coverage
of the United Nations in the two wars.
Now similarly if you look at
the second cluster Class two,
it has to do with the killing of people,
and, again,
it's not surprising if you know
the background about wars.
All the wars involve killing of people,
but
imagine if you are not familiar
with the text collections.
We have a lot of text articles, and
such a technique can reveal the common
topics covered in both sets of articles.
It can be used to review common topics
in multiple sets of articles as well.
If you look at of course in
that column of cluster two,
you see variations of killing of people
and that corresponds to different contexts
And here is another example of results
obtained from blog articles
about Hurricane Katrina.
In this case,
what you see here is visualization of
the trends of topics over time.
And the top one shows just
the temporal trends of two topics.
One is oil price, and one is about
the flooding of the city of New Orleans.
Now these topics are obtained from
blog articles about Hurricane Katrina.
And people talk about these topics.
And end up teaching to some other topics.
But the visualisation shows
that with this technique,
we can have conditional
distribution of time.
Given a topic.
So this allows us to plot
this conditional probability
the curve is like what you're seeing here.
We see that, initially, the two
curves tracked each other very well.
But later we see the topic of New Orleans
was mentioned again but oil price was not.
And this turns out to be
the time period when another hurricane,
hurricane Rita hit the region.
And that apparently triggered more
discussion about the flooding of the city.
The bottom curve shows
the coverage of this topic
about flooding of the city by block
articles in different locations.
And it also shows some shift of
coverage that might be related to
people's migrating from the state
of Louisiana to Texas for example.
So in this case we can see the time can
be used as context to review trends of
topics.
These are some additional
results on spacial patterns.
In this case it was about
the topic of government response.
And there was some criticism about
the slow response of government
in the case of Hurricane Katrina.
And the discussion now is
covered in different locations.
And these visualizations show the coverage
in different weeks of the event.
And initially it's covered
mostly in the victim states,
in the South, but then gradually
spread into other locations.
But in week four,
which is shown on the bottom left,
we see a pattern that's very similar
to the first week on the top left.
And that's when again
Hurricane Rita hit in the region.
So such a technique would allow
us to use location as context
to examine their issues of topics.
And of course the moral
is completely general so
you can apply this to any
other connections of text.
To review spatial temporal patterns.
His view found another application
of this kind of model,
where we look at the use of the model for
event impact analysis.
So here we're looking at the research
articles information retrieval.
IR, particularly SIGIR papers.
And the topic we are focusing on
is about the retrieval models.
And you can see the top words with high
probability about this model on the left.
And then we hope to examine
the impact of two events.
One is a start of TREC, for
Text and Retrieval Conference.
This is a major evaluation
sponsored by U.S.
government, and was launched in 1992 or
around that time.
And that is known to have made a impact on
the topics of research
information retrieval.
The other is the publication of
a seminal paper, by Croft and Porte.
This is about a language model
approach to information retrieval.
It's also known to have made a high
impact on information retrieval research.
So we hope to use this kind of
model to understand impact.
The idea here is simply to
use the time as context.
And use these events to divide
the time periods into a period before.
For the event and
another after this event.
And then we can compare
the differences of the topics.
The and the variations, etc.
So in this case,
the results show before track the study of
retrieval models was mostly a vector
space model, Boolean model etc.
But the after Trec,
apparently the study of retrieval models
have involved a lot of other words.
That seems to suggest some
different retrieval tasks, so for
example, email was used in
the enterprise search tasks and
subtopical retrieval was another
task later introduced by Trec.
On the bottom,
we see the variations that are correlated
with the propagation of
the language model paper.
Before, we have those classic
probability risk model,
logic model, Boolean etc., but after 1998,
we see clear dominance of language
model as probabilistic models.
And we see words like language model,
estimation of parameters, etc.
So this technique here can use events as
context to understand the impact of event.
Again the technique is generals so
you can use this to analyze
the impact of any event.
Here are some suggested readings.
The first is paper about simple staging of
psi to label cross-collection comparison.
It's to perform comparative
text mining to allow us to
extract common topics shared
by multiple collections.
And there are variations
in each collection.
The second one is the main
paper about the CPLSA model.
Was a discussion of a lot of applications.
The third one has a lot of details
about the special temporal patterns for
the Hurricane Katrina example.
[MUSIC]

[SOUND] This lecture is about
how to mine text data with
social network as context.
In this lecture we're going to continue
discussing contextual text mining.
In particular, we're going to look at
the social network of others as context.
So first, what's our motivation for using
network context for analysis of text?
The context of a text
article can form a network.
For example the authors
of research articles
might form collaboration networks.
But authors of social media content
might form social networks.
For example,
in Twitter people might follow each other.
Or in Facebook as people might
claim friends of others, etc.
So such context connects
the content of the others.
Similarly, locations associated with
text can also be connected to form
geographical network.
But in general you can can imagine
the metadata of the text data
can form some kind of network
if they have some relations.
Now there is some benefit in
jointly analyzing text and
its social network context or
network context in general.
And that's because we can use network to
impose some constraints on topics of text.
So for example it's reasonable
to assume that authors
connected in collaboration networks
tend to write about the similar topics.
So such heuristics can be used
to guide us in analyzing topics.
Text also can help characterize the
content associated with each subnetwork.
And this is to say that both
kinds of data, the network and
text, can help each other.
So for example the difference in
opinions expressed that are in
two subnetworks can be reviewed by
doing this type of joint analysis.
So here briefly you could use a model
called a network supervised topic model.
In this slide we're going to
give some general ideas.
And then in the next slide we're
going to give some more details.
But in general in this part of the course
we don't have enough time to cover
these frontier topics in detail.
But we provide references
that would allow you to
read more about the topic
to know the details.
But it should still be useful
to know the general ideas.
And to know what they can do to know
when you might be able to use them.
So the general idea of network
supervised topic model is the following.
Let's start with viewing
the regular topic models.
Like if you had an LDA as
sorting optimization problem.
Of course, in this case,
the optimization objective
function is a likelihood function.
So we often use maximum likelihood
estimator to obtain the parameters.
And these parameters will give us
useful information that we want to
obtain from text data.
For example, topics.
So we want to maximize the probability
of tests that are given the parameters
generally denoted by number.
The main idea of incorporating network is
to think about the constraints that
can be imposed based on the network.
In general,
the idea is to use the network to
impose some constraints on
the model parameters, lambda here.
For example,
the text at adjacent nodes of the network
can be similar to cover similar topics.
Indeed, in many cases,
they tend to cover similar topics.
So we may be able to smooth
the topic distributions
on the graph on the network so
that adjacent nodes will have
very similar topic distributions.
So they will share a common
distribution on the topics.
Or have just a slight variations of the
topic of distributions, of the coverage.
So, technically, what we can do
is simply to add a network and
use the regularizers to the likelihood
of objective function as shown here.
So instead of just optimize
the probability of test
data given parameters lambda, we're
going to optimize another function F.
This function combines the likelihood with
a regularizer function called R here.
And the regularizer defines
the the parameters lambda and the Network.
It tells us basically
what kind of parameters are preferred
from a network constraint perspective.
So you can easily see this is in effect
implementing the idea of imposing
some prior on the model parameters.
Only that we're not necessary
having a probabilistic model, but
the idea is the same.
We're going to combine the two in
one single objective function.
So, the advantage of this idea
is that it's quite general.
Here the top model can be any
generative model for text.
It doesn't have to be PLSA or
LEA, or the current topic models.
And similarly,
the network can be also in a network.
Any graph that connects
these text objects.
This regularizer can
also be any regularizer.
We can be flexible in capturing different
heuristics that we want to capture.
And finally,
the function F can also vary, so
there can be many different
ways to combine them.
So, this general idea is actually quite,
quite powerful.
It offers a general approach
to combining these different
types of data in single
optimization framework.
And this general idea can really
be applied for any problem.
But here in this paper reference here,
a particular instantiation
called a NetPLSA was started.
In this case, it's just for
instantiating of PLSA to incorporate this
simple constraint imposed by network.
And the prior here is the neighbors on
the network must have
similar topic distribution.
They must cover similar
topics in similar ways.
And that's basically
what it says in English.
So technically we just have
a modified objective function here.
Let's define both the texts you can
actually see in the network graph G here.
And if you look at this formula,
you can actually recognize
some part fairly familiarly.
Because they are, they should be
fairly familiar to you by now.
So can you recognize which
part is the likelihood for
the test given the topic model?
Well if you look at it, you will see this
part is precisely the PLSA log-likelihood
that we want to maximize when we
estimate parameters for PLSA alone.
But the second equation shows some
additional constraints on the parameters.
And in particular,
we'll see here it's to measure
the difference between the topic
coverage at node u and node v.
The two adjacent nodes on the network.
We want their distributions to be similar.
So here we are computing the square
of their differences and
we want to minimize this difference.
And note that there's a negative sign in
front of this sum, this whole sum here.
So this makes it possible to find
the parameters that are both to
maximize the PLSA log-likelihood.
That means the parameters
will fit the data well and,
also to respect that this
constraint from the network.
And this is the negative
sign that I just mentioned.
Because this is an negative sign,
when we maximize this
object in function we'll actually
minimize this statement term here.
So if we look further in
this picture we'll see
the results will weight of
edge between u and v here.
And that space from out network.
If you have a weight that says well,
these two nodes are strong
collaborators of researchers.
These two are strong connections
between two people in a social network.
And they would have weight.
Then that means it would be more important
that they're topic coverages are similar.
And that's basically what it says here.
And finally you see
a parameter lambda here.
This is a new parameter to control
the influence of network constraint.
We can see easily, if lambda is set to 0,
we just go back to the standard PLSA.
But when lambda is set to a larger value,
then we will let the network
influence the estimated models more.
So as you can see, the effect here is
that we're going to do basically PLSA.
But we're going to also try
to make the topic coverages
on the two nodes that are s