We
introduce
a
technique
for
identifying
the
most
salient
participants
in
a
discussion
.
Our
method
,
MavenRank
is
based
on
lexical
centrality
:
a
random
walk
is
performed
on
a
graph
in
which
each
node
is
a
participant
in
the
discussion
and
an
edge
links
two
participants
who
use
similar
rhetoric
.
As
a
test
,
we
used
MavenRank
to
identify
the
most
influential
members
of
the
US
Senate
using
data
from
the
US
Congressional
Record
and
used
committee
ranking
to
evaluate
the
output
.
Our
results
show
that
MavenRank
scores
are
largely
driven
by
committee
status
in
most
topics
,
but
can
capture
speaker
centrality
in
topics
where
speeches
are
used
to
indicate
ideological
position
instead
of
influence
legislation
.
1
Introduction
In
a
conversation
or
debate
between
a
group
of
people
,
we
can
think
of
two
remarks
as
interacting
if
they
are
both
comments
on
the
same
topic
.
For
example
,
if
one
speaker
says
"
taxes
should
be
lowered
to
help
business
,
"
while
another
argues
"
taxes
should
be
raised
to
support
our
schools
,
"
the
speeches
are
interacting
with
each
other
by
describing
the
same
issue
.
In
a
debate
with
many
people
arguing
about
many
different
things
,
we
could
imagine
a
large
network
of
speeches
interacting
with
each
other
in
the
same
way
.
If
we
associate
each
speech
in
the
network
with
its
speaker
,
we
can
try
to
identify
the
most
important
people
in
the
debate
based
on
how
central
their
speeches
are
in
the
network
.
To
describe
this
type
of
centrality
,
we
borrow
a
term
from
The
Tipping
Point
(
Gladwell
,
2002
)
,
in
which
Gladwell
describes
a
certain
type
of
personality
in
a
social
network
called
a
maven
.
A
maven
is
a
trusted
expert
in
a
specific
field
who
influences
other
people
by
passing
information
and
advice
.
In
this
paper
,
our
goal
is
to
identify
authoritative
speakers
who
control
the
spread
of
ideas
within
a
topic
.
To
do
this
,
we
introduce
MavenRank
,
which
measures
the
centrality
of
speeches
as
nodes
in
the
type
of
network
described
in
the
previous
paragraph
.
Significant
research
has
been
done
in
the
area
of
identifying
central
nodes
in
a
network
.
Various
methods
exist
for
measuring
centrality
,
including
degree
centrality
,
closeness
,
betweenness
(
Freeman
,
1977
;
Newman
,
2003
)
,
and
eigenvector
centrality
.
Eigenvector
centrality
in
particular
has
been
successfully
applied
to
many
different
types
of
networks
,
including
hyperlinked
web
pages
(
Brin
and
Page
,
1998
;
Kleinberg
,
1998
)
,
lexical
networks
(
Erkan
and
Radev
,
2004
;
Mihalcea
and
Ta-rau
,
2004
;
Kurland
and
Lee
,
2005
;
Kurland
and
Lee
,
2006
)
,
and
semantic
networks
(
Mihalcea
et
al.
,
2004
)
.
The
authors
of
(
Lin
and
Kan
,
2007
)
extended
these
methods
to
include
timestamped
graphs
where
nodes
are
added
over
time
and
applied
it
to
multi-document
summarization
.
In
(
Tong
and
Faloutsos
,
2006
)
,
the
authors
use
random
walks
on
a
graph
as
a
method
for
finding
a
subgraph
that
best
connects
some
or
all
of
a
set
of
query
nodes
.
In
our
paper
,
we
introduce
a
new
application
of
eigenvector
cen-trality
for
identifying
the
central
speakers
in
the
type
of
debate
or
conversation
network
described
above
.
Our
method
is
based
on
the
one
described
in
(
Erkan
Proceedings
of
the
2007
Joint
Conference
on
Empirical
Methods
in
Natural
Language
Processing
and
Computational
Natural
Language
Learning
,
pp.
658-666
,
Prague
,
June
2007
.
©
2007
Association
for
Computational
Linguistics
but
modified
to
rank
speakers
instead
of
documents
or
sentences
.
In
our
paper
,
we
apply
our
method
to
analyze
the
US
Congressional
Record
,
which
is
a
verbatim
transcript
of
speeches
given
in
the
United
States
House
of
Representatives
and
Senate
.
The
Record
is
a
dense
corpus
of
speeches
made
by
a
large
number
of
people
over
a
long
period
of
time
.
Using
the
transcripts
of
political
speeches
adds
an
extra
layer
of
meaning
onto
the
measure
of
speaker
centrality
.
The
centrality
of
speakers
in
Congress
can
be
thought
of
as
a
measure
of
relative
importance
or
influence
in
the
US
legislative
process
.
We
can
also
use
speaker
centrality
to
analyze
committee
membership
:
are
the
central
speakers
on
a
given
issue
ranking
members
of
a
related
committee
?
Is
there
a
type
of
importance
captured
through
speaker
centrality
that
isn
't
obvious
in
the
natural
committee
rankings
?
There
has
been
growing
interest
in
using
techniques
from
natural
language
processing
in
the
area
of
political
science
.
In
(
Porter
et
al.
,
2005
)
the
authors
performed
a
network
analysis
of
members
and
committees
of
the
US
House
of
Representatives
.
They
found
connections
between
certain
committees
and
political
positions
that
suggest
that
committee
membership
is
not
determined
at
random
.
In
(
Thomas
et
al.
,
2006
)
,
the
authors
use
the
transcripts
of
debates
from
the
US
Congress
to
automatically
classify
speeches
as
supporting
or
opposing
a
given
topic
by
taking
advantage
of
the
voting
records
of
the
speakers
.
In
(
Wang
et
al.
,
2005
)
,
the
authors
use
a
generative
model
to
simultaneously
discover
groups
of
voters
and
topics
using
the
voting
records
and
the
text
from
bills
of
the
US
Senate
and
the
United
Nations
.
The
authors
of
(
Quinn
et
al.
,
2006
)
introduce
a
multinomial
mixture
model
to
perform
unsupervised
clustering
of
Congressional
speech
documents
into
topically
related
categories
.
We
rely
on
the
output
of
this
model
to
cluster
the
speeches
from
the
Record
in
order
to
compare
speaker
rankings
within
a
topic
to
related
committees
.
We
take
advantage
of
the
natural
measures
of
prestige
in
Senate
committees
and
use
them
as
a
standard
for
comparison
with
MavenRank
.
Our
hypothesis
is
that
MavenRank
centrality
will
capture
the
importance
of
speakers
based
on
the
natural
committee
rankings
and
seniority
.
We
can
test
this
claim
by
clustering
speeches
into
topics
and
then
mapping
the
topics
to
related
committees
.
If
the
hypothesis
is
correct
,
then
the
speaker
centrality
should
be
correlated
with
the
natural
committee
rankings
.
There
have
been
other
attempts
to
link
floor
participation
with
topics
in
political
science
.
In
(
Hall
,
1996
)
,
the
author
found
that
serving
on
a
committee
can
positively
predict
participation
in
Congress
,
but
that
seniority
was
not
a
good
predictor
.
His
measure
only
looked
at
six
bills
in
three
committees
,
so
his
method
is
by
far
not
as
comprehensive
as
the
one
that
we
present
here
.
Our
approach
with
MavenRank
differs
from
previous
work
by
providing
a
large
scale
analysis
of
speaker
centrality
and
bringing
natural
language
processing
techniques
to
the
realm
of
political
science
.
2.1
The
US
Congressional
Speech
Corpus
The
text
used
in
the
experiments
is
from
the
United
States
Congressional
Speech
corpus
(
Monroe
et
al.
,
2006
)
,
which
is
an
XML
formatted
version
of
the
electronic
United
States
Congressional
Record
from
the
Library
of
Congress1
.
The
Congressional
Record
is
a
verbatim
transcript
of
the
speeches
made
in
the
US
House
of
Representatives
and
Senate
beginning
with
the
101st
Congress
in
1998
and
includes
tens
of
thousands
of
speeches
per
year
.
In
our
experiments
we
focused
on
the
records
from
the
105th
and
106th
Senates
.
The
basic
unit
of
the
US
Congressional
Speech
corpus
is
a
record
,
which
corresponds
to
a
single
subsection
of
the
print
version
of
the
Congressional
Record
and
may
contain
zero
or
more
speakers
.
Each
paragraph
of
text
within
a
record
is
tagged
as
either
speech
or
non-speech
and
each
paragraph
of
speech
text
is
tagged
with
the
unique
id
of
the
speaker
.
Figure
1
shows
an
example
record
file
for
the
sixth
record
on
July
14th
,
1997
in
the
105th
Senate
.
In
our
experiments
we
use
a
smaller
unit
of
analysis
called
a
speech
document
by
taking
all
of
the
text
of
a
speaker
within
a
single
record
.
The
capitalization
and
punctuation
is
then
removed
from
the
text
as
in
(
Monroe
et
al.
,
2006
)
and
then
the
text
stemmed
using
Porter
's
Snowball
II
stemmer2
.
Figure
1
shows
an
example
speech
document
for
speaker
15703
(
Herb
Kohl
of
Wisconsin
)
that
has
been
generated
from
the
record
in
Figure
1
.
In
addition
to
speech
documents
,
we
also
use
speaker
documents
.
A
speaker
document
is
the
concatenation
of
all
of
a
speaker
's
speech
documents
within
a
single
session
and
topic
(
so
a
single
speaker
may
have
multiple
speaker
documents
across
topics
)
.
For
example
within
the
105th
Senate
in
topic
1
(
"
Judicial
Nominations
"
)
,
Senator
Kohl
has
four
speech
documents
,
so
the
speaker
document
attributed
to
him
within
this
session
and
topic
would
be
the
text
of
these
four
documents
treated
as
a
single
unit
.
The
order
of
the
concatenation
does
not
matter
since
we
will
look
at
it
as
a
vector
of
weighted
term
frequencies
(
see
Section
3.2
)
.
We
used
the
direct
output
of
the
42-topic
model
of
the
105th-108th
Senates
from
(
Quinn
et
al.
,
2006
)
to
further
divide
the
speech
documents
into
topic
clusters
.
In
their
paper
,
they
use
a
model
where
the
probabilities
of
a
document
belonging
to
a
certain
topic
varies
smoothly
over
time
and
the
words
within
a
given
document
have
exactly
the
same
probability
of
being
drawn
from
a
particular
topic
.
These
two
properties
make
the
model
different
than
standard
mixture
models
(
McLachlan
and
Peel
,
2000
)
and
the
latent
Dirichlet
allocation
model
of
(
Blei
et
al.
,
2003
)
.
The
model
of
(
Quinn
et
al.
,
2006
)
is
most
closely
related
to
the
model
of
(
Blei
and
Lafferty
,
2006
)
,
who
present
a
generalization
of
the
model
topics
and
their
related
committees
.
The
output
from
the
topic
model
is
a
D
x
42
matrix
Z
where
D
is
the
number
of
speech
documents
and
the
element
zdk
represents
the
probability
of
the
dth
speech
document
being
generated
by
topic
k.
We
clustered
the
speech
documents
by
assigning
a
speech
document
d
to
the
kth
cluster
where
If
the
maximum
value
is
not
unique
,
we
arbitrarily
assign
d
to
the
lowest
numbered
cluster
where
zdj
is
2http
:
/
/
snowball.tartarus.org
/
algorithms
/
english
/
stemmer.html
a
maximum
.
A
typical
topic
cluster
contains
several
hundred
speech
documents
,
while
some
of
the
larger
topic
clusters
contain
several
thousand
.
2.3
Committee
Membership
Information
The
committee
membership
information
that
we
used
in
the
experiments
is
from
Stewart
and
Woon
's
committee
assignment
codebook
(
Stewart
and
Woon
,
2005
)
.
This
provided
us
with
a
roster
for
each
committee
and
rank
and
seniority
information
for
each
member
.
In
our
experiments
we
use
the
rank
within
party
and
committee
seniority
member
attributes
to
test
the
output
of
our
pipeline
.
The
rank
within
party
attribute
orders
the
members
of
a
committee
based
on
the
Resolution
that
appointed
the
members
with
the
highest
ranking
members
having
the
lowest
number
.
The
chair
and
ranking
members
always
receive
a
rank
of
1
within
their
party
.
A
committee
member
's
committee
seniority
attribute
corresponds
to
the
number
of
years
that
the
member
has
served
on
the
given
committee
.
2.4
Mapping
Topics
to
Committees
In
order
to
test
our
hypothesis
that
lexical
centrality
is
correlated
with
the
natural
committee
rankings
,
we
needed
a
map
from
topics
to
related
committees
.
We
based
our
mapping
on
Senate
Rule
XXV,3
which
defines
the
committees
,
and
the
descriptions
on
committee
home
pages
.
Table
1
shows
the
map
,
where
a
topic
's
related
committees
are
listed
in
italics
below
the
topic
name
.
Because
we
are
matching
short
topic
names
to
the
complex
descriptions
given
by
Rule
XXV
,
the
topic-committee
map
is
not
one
to
one
or
even
particularly
well
defined
:
some
topics
are
mapped
to
multiple
committees
,
some
topics
are
not
mapped
to
any
committees
,
and
two
different
topics
may
be
mapped
to
the
same
committee
.
This
is
not
a
major
problem
because
even
if
a
one
to
one
map
between
topics
and
committees
existed
,
speakers
from
outside
a
topic
's
related
committee
are
free
to
participate
in
the
topic
simply
by
giving
a
speech
.
Therefore
there
is
no
way
to
rank
all
speakers
in
a
topic
using
committee
information
.
To
test
our
hypotheses
,
we
focused
our
attention
on
topics
that
have
at
least
one
related
committee
.
In
Section
4.3
we
describe
how
the
MavenRank
scores
mr
presid
a
the
rank
democrat
on
the
antitrust
subcommitte
let
me
tell
you
why
i
support
mr
klein
nomin
why
he
i
a
good
choic
for
the
job
and
why
we
ought
to
confirm
him
todai
first
joel
klein
i
an
accomplish
lawyer
with
a
distinguish
career
he
graduat
from
columbia
univers
and
harvard
law
school
and
clerk
for
the
u
court
of
appeal
here
in
washington
then
for
justic
powel
just
a
importantli
he
i
the
presid
choic
to
head
the
antitrust
divis
and
i
believ
that
ani
presid
democrat
or
republican
i
entitl
to
a
strong
presumpt
in
favor
of
hi
execut
branch
nomine
second
joel
klein
i
a
pragmatist
not
an
idealogu
hi
answer
at
hi
confirm
hear
suggest
that
he
i
not
antibusi
a
some
would
claim
the
antitrust
divis
wa
in
the
late
197
0
nor
anticonsum
a
some
argu
the
divis
wa
dure
the
1980
instead
he
will
plot
a
middl
cours
i
believ
that
promot
free
market
fair
competit
and
consum
welfar
the
third
reason
we
should
confirm
joel
klein
i
becaus
no
on
deserv
to
linger
in
thi
type
of
legisl
limbo
here
in
congress
we
need
the
input
of
a
confirm
head
of
the
antitrust
divis
to
give
u
the
administr
view
on
a
varieti
of
import
polici
matter
defens
consolid
electr
deregul
and
telecommun
merger
among
other
we
need
someon
who
can
speak
with
author
for
the
divis
without
a
cloud
hang
over
hi
head
more
than
that
without
a
confirm
leader
moral
at
the
antitrust
divis
i
suffer
and
given
the
pace
at
which
the
presid
ha
nomin
and
the
senat
ha
confirm
appointe
if
we
fail
to
approv
mr
klein
it
will
be
at
least
a
year
befor
we
confirm
a
replac
mayb
longer
and
mayb
never
so
we
need
to
act
now
we
can
't
afford
to
let
the
antitrust
divis
continu
to
drift
final
mr
presid
i
have
great
respect
for
the
senat
from
south
carolina
a
well
a
the
senat
from
nebraska
and
north
dakota
thei
have
been
forc
advoc
for
consum
on
telecommun
matter
and
Figure
1
:
A
sample
of
the
text
from
record
105.sen.19970714.006.xml
and
the
speech
document
for
Senator
Herb
Kohl
of
Wisconsin
(
id
15703
)
generated
from
it
.
The
"
.
.
.
"
represents
omitted
text
.
Judicial
Nominations
Procedural
1
(
Housekeeping
1
)
Procedural
2
(
Housekeeping
2
)
Veterans
'
Affairs
Rules
and
Administration
Armed
Forces
1
(
Manpower
)
Gordon
Smith
re
Hate
Crime
Debt
/
Deficit
/
Social
Security
Armed
Forces
2
(
Infrastructure
)
Appropriations
Symbolic
(
Tribute
-
Living
)
Symbolic
(
Congratulations
-
Sports
)
Aging
(
Special
Committee
)
Supreme
Court
/
Constitutional
Environment
2
(
Regulation
)
Defense
(
Use
of
Force
)
Commercial
Infrastructure
Environment
and
Public
Works
Armed
Services
Commerce
,
Science
,
and
Transportation
Symbolic
(
Remembrance
-
Military
)
International
Affairs
(
Diplomacy
)
Jesse
Helms
re
Debt
Environment
1
(
Public
Lands
)
Procedural
5
(
Housekeeping
3
)
Energy
and
Natural
Resources
Judiciary
Procedural
6
(
Housekeeping
4
)
Symbolic
(
Tribute
-
Constituent
)
Symbolic
(
Remembrance
-
Nonmilitary
)
International
Affairs
(
Arms
Control
)
Foreign
Relations
Social
Welfare
Intelligence
(
Select
Committee
)
Small
Business
and
Entrepreneurship
Agriculture
,
Nutrition
,
and
Forestry
Homeland
Security
and
Governmental
Affairs
Foreign
Trade
Banking
,
Housing
,
and
Urban
Affairs
Education
Health
,
Education
,
Labor
,
and
Pensions
Table
1
:
The
numbers
and
names
of
the
42
topics
from
(
Quinn
et
al.
,
2006
)
with
our
mappings
to
related
committees
(
listed
below
the
topic
name
,
if
available
)
.
&lt;
TITLE
&gt;
NOMINATION
OF
JOEL
KLEIN
TO
BE
ASSISTANT
ATTORNEY
GENERAL
IN
CHARGE
OF
THE
ANTITRUST
DIVISION
&lt;
/
TITLE
&gt;
&lt;
SPEAKER
&gt;
NULL
&lt;
/
SPEAKER
&gt;
&lt;
NONSPEECH
&gt;
NOMINATION
OF
JOEL
KLEIN
TO
BE
ASSISTANT
ATTORNEY
GENERAL
IN
CHARGE
OF
THE
ANTITRUST
DIVISION
&lt;
SPEECH
&gt;
Mr.
President
,
as
the
ranking
Democrat
on
the
Antitrust
Subcommittee
,
let
me
tell
you
why
I
support
Mr.
Klein
's
nomination
,
why
he
is
a
good
choice
for
the
job
,
and
why
we
ought
to
confirm
him
today
.
of
speakers
who
are
not
members
of
related
committees
were
taken
into
account
when
we
measured
the
rank
correlations
.
3
MavenRank
and
Lexical
Similarity
The
following
sections
describe
MavenRank
,
a
measure
of
speaker
centrality
,
and
tf-idf
cosine
similarity
,
which
is
used
to
measure
the
lexical
similarity
of
speeches
.
MavenRank
is
a
graph-based
method
for
finding
speaker
centrality
.
It
is
similar
to
the
methods
in
(
Erkan
and
Radev
,
2004
;
Mihalcea
and
Tarau
,
2004
;
Kurland
and
Lee
,
2005
)
,
which
can
be
used
for
ranking
sentences
in
extractive
summaries
and
documents
in
an
information
retrieval
system
.
Given
a
collection
of
speeches
s1
,
.
.
.
,
sN
and
a
measure
of
lexical
similarity
between
pairs
sim
(
si
,
Sj
)
&gt;
0
,
a
similarity
graph
can
be
constructed
.
The
nodes
of
the
graph
represent
the
speeches
and
a
weighted
similarity
edge
is
placed
between
pairs
that
exceed
a
similarity
threshold
smin
.
MavenRank
is
based
on
the
premise
that
important
speakers
will
have
central
speeches
in
the
graph
,
and
that
central
speeches
should
be
similar
to
other
central
speeches
.
A
recursive
explanation
of
this
concept
is
that
the
score
of
a
speech
should
be
proportional
to
the
scores
of
its
similar
neighbors
.
Given
a
speech
s
in
the
graph
,
we
can
express
the
recursive
definition
of
its
score
p
(
s
)
as
where
adj
[
s
]
is
the
set
of
all
speeches
adjacent
to
s
and
wdeg
(
t
)
=
u_adj
[
t
]
sim
(
t
,
u
)
,
the
weighted
degree
of
t.
Equation
(
1
)
captures
the
idea
that
the
MavenRank
score
of
a
speech
is
distributed
to
its
neighbors
.
We
can
rewrite
this
using
matrix
notation
as
where
S
(
i
,
j
)
=
sim
(
si
;
sj
)
.
Equation
(
2
)
shows
that
the
vector
of
MavenRank
scores
p
is
the
left
eigenvector
of
B
with
eigenvalue
1
.
We
can
prove
that
the
eigenvector
p
exists
by
using
a
techinque
from
(
Page
et
al.
,
1999
)
.
We
can
treat
the
matrix
B
as
a
Markov
chain
describing
the
transition
probabilities
of
a
random
walk
on
the
speech
similarity
graph
.
The
vector
p
then
represents
the
stationary
distribution
of
the
random
walk
.
It
is
possible
that
some
parts
of
the
graph
are
disconnected
or
that
the
walk
gets
trapped
in
a
component
.
These
problems
are
solved
by
reserving
a
small
escape
probability
at
each
node
that
represents
a
chance
of
jumping
to
any
node
in
the
graph
,
making
the
Markov
chain
irreducible
and
aperiodic
,
which
guarantees
the
existence
of
the
eigenvector
.
Assuming
a
uniform
escape
probability
for
each
node
on
the
graph
,
we
can
rewrite
Equation
(
2
)
as
where
U
is
a
square
matrix
with
U
(
i
,
j
)
=
1
/
N
for
all
i
and
j
,
N
is
the
number
of
nodes
,
and
d
is
the
escape
probability
chosen
in
the
interval
[
0.1
,
0.2
]
(
Brin
and
Page
,
1998
)
.
Equation
(
4
)
is
known
as
PageRank
(
Page
et
al.
,
1999
)
and
is
used
for
determining
prestige
on
the
web
in
the
Google
search
engine
.
3.2
Lexical
Similarity
In
our
experiments
,
we
used
tf-idf
cosine
similarity
to
measure
lexical
similarity
between
speech
documents
.
We
represent
each
speech
document
as
a
vector
of
term
frequencies
(
or
f
)
,
which
are
weighted
according
to
the
relative
importance
of
the
given
term
in
the
cluster
.
The
terms
are
weighted
by
their
inverse
document
frequency
or
idf
.
The
idf
of
a
term
w
is
given
by
(
Sparck-Jones
,
1972
)
where
N
is
the
number
of
documents
in
the
corpus
and
nw
is
the
number
of
documents
in
the
corpus
containing
the
term
w.
It
follows
that
very
common
words
like
"
of
"
or
"
the
"
have
a
very
low
idf
,
while
the
idf
values
of
rare
words
are
higher
.
In
our
experiments
,
we
calculated
the
idf
values
for
each
topic
using
all
speech
documents
across
sessions
within
the
Abortion
Child
Education
Workers
,
Protection
Retirement
Figure
2
:
MavenRank
percentiles
for
three
speakers
over
four
topics
.
given
topic
.
We
calculated
topic-specific
idf
values
because
some
words
may
be
relatively
unimportant
in
one
topic
,
but
important
in
another
.
For
example
,
in
topic
22
(
"
Abortion
"
)
,
the
idf
of
the
term
"
abort
"
is
near
0.20
,
while
in
topic
38
(
"
Taxes
"
)
,
its
idf
is
near
7.18
.
The
tf-idf
cosine
similarity
measure
tf-idf-cosine
(
u
,
v
)
is
defined
as
which
is
the
cosine
of
the
angle
between
the
tf-idf
vectors
.
There
are
other
alternatives
to
tf-idf
cosine
similarity
.
Some
other
possible
similarity
measures
are
document
edit
distance
,
the
language
models
from
(
Kurland
and
Lee
,
2005
)
,
or
generation
probabilities
from
(
Erkan
,
2006
)
.
For
simplicity
,
we
only
used
tf-idf
similarities
in
our
experiments
,
but
any
of
these
measures
could
be
used
in
this
case
.
We
used
the
topic
clusters
from
the
105th
Senate
as
training
data
to
adjust
the
parameter
smin
and
observe
trends
in
the
data
.
We
did
not
run
experiments
to
test
the
effect
of
different
values
of
smin
on
MavenRank
scores
,
but
our
chosen
value
of
0.25
has
shown
to
give
acceptable
results
in
similar
experiments
(
Erkan
and
Radev
,
2004
)
.
We
used
the
topic
clusters
from
the
106th
Senate
as
test
data
.
For
the
speech
document
networks
,
there
was
an
average
of
351
nodes
(
speech
documents
)
and
2142
edges
per
topic
.
For
the
speaker
document
networks
,
there
was
an
average
of
63
nodes
(
speakers
)
and
545
edges
per
topic
.
4.2
Experimental
Setup
We
set
up
a
pipeline
using
a
Perl
implementation
of
tf-idf
cosine
similarity
and
MavenRank
.
We
ran
MavenRank
on
the
topic
clusters
and
ranked
the
speakers
based
on
the
output
.
We
used
two
different
types
granularities
of
the
graphs
as
input
:
one
where
the
nodes
are
speech
documents
and
another
where
the
nodes
are
speaker
documents
(
see
Section
2.1
)
.
For
the
speech
document
graph
,
a
speaker
's
score
is
determined
by
the
sum
of
the
MavenRank
scores
of
the
speeches
given
by
that
speaker
.
4.3
Evaluation
Methods
To
evaluate
our
output
,
we
estimate
independent
ordinary
least
squares
linear
regression
models
of
MavenRank
centrality
for
topics
with
at
least
one
related
committee
(
there
are
29
total
)
:
where
i
indexes
Senators
,
k
indexes
topics
,
Senioriti
ik
is
the
number
of
years
Senator
i
has
served
on
the
relevant
committee
for
topic
k
(
value
zero
for
those
not
on
a
relevant
committee
)
and
RankingMemberjk
has
the
value
of
one
only
for
the
Chair
and
ranking
minority
member
of
a
relevant
committee
.
We
are
interested
primarily
in
the
overall
significance
of
the
estimated
model
(
indicating
committee
effects
)
and
,
secondarily
,
in
the
specific
source
of
any
committee
effect
in
seniority
or
committee
rank
.
Table
2
summarizes
the
results
.
"
Maven
"
status
on
most
topics
does
appear
to
be
driven
by
committee
status
,
as
expected
.
There
are
particularly
strong
effects
of
seniority
and
rank
in
topics
tied
to
the
Judiciary
,
Foreign
Relations
,
and
Armed
Services
committees
,
as
well
as
legislation-rich
areas
of
domestic
policy
.
Perhaps
of
greater
interest
are
the
topics
that
do
not
have
committee
effects
.
These
are
of
three
distinct
types
.
The
first
are
highly
politicized
topics
for
which
speeches
are
intended
not
to
influence
Seniority
and
Ranking
Status
Both
Significant
Seniority
and
Ranking
Status
Jointly
Significant
18
Constitutional
Environment
2
[
Regulation
]
Seniority
Significant
Banking
/
Finance
No
Significant
Effects
of
Committee
Status
Environment
1
[
Public
Lands
]
Abortion
Armed
Forces
2
[
Infrastructure
]
19
Commercial
Infrastructure
Agriculture
Debt
/
Social
Security
Intelligence
Ranking
Status
Significant
Campaign
Finance
Child
Protection
1
Judicial
Nominations
14
Social
Welfare
aF-test
for
joint
significance
of
committee
variables
.
bT-test
for
significance
of
committee
seniority
.
cT-test
for
significance
of
chair
or
ranking
member
status
.
Table
2
:
Significance
tests
for
ordinary
least
squares
(
OLS
)
linear
regressions
of
MavenRank
scores
(
Speech-documents
graph
)
on
committee
seniority
(
in
years
)
and
ranking
status
(
chair
or
ranking
member
)
,
106th
Senate
,
topic-by-topic
.
Results
for
the
speaker-documents
graph
are
similar
.
legislation
as
much
as
indicate
an
ideological
or
partisan
position
,
so
the
mavens
are
not
on
particular
committees
(
abortion
,
children
,
seniors
,
the
economy
)
.
The
second
are
"
distributive
politics
"
topics
where
many
Senators
speak
to
defend
state
or
regional
interests
,
so
debate
is
broadly
distributed
and
there
are
no
clear
mavens
(
agriculture
,
military
base
closures
,
public
lands
)
.
Third
are
topics
where
there
are
not
enough
speeches
for
clear
results
,
because
most
debate
occurred
after
1999-2000
(
post-9
/
11
intelligence
reform
,
McCain-Feingold
campaign
finance
reform
)
.
Alternative
models
,
using
measures
of
centrality
based
on
the
centroid
were
also
examined
.
Distance
to
centroid
provides
broadly
similar
results
as
MavenRank
,
with
several
marginal
significance
results
reversed
in
each
direction
.
Cosine
similarity
with
centroid
,
on
the
other
hand
,
appears
to
have
no
relationship
with
committee
structure
.
Figure
2
shows
the
MavenRank
percentiles
(
using
the
speech
document
network
)
for
Senators
Rick
Santorum
,
Barbara
Boxer
,
and
Edward
Kennedy
across
a
few
topics
in
the
106th
Senate
.
These
sample
scores
conform
to
the
expected
rankings
for
these
speakers
.
In
this
session
,
Santorum
was
the
sponsor
of
a
bill
to
ban
partial
birth
abortions
and
was
a
spokesman
for
Social
Security
reform
,
which
support
his
high
ranking
in
abortion
and
workers
/
retirement
.
Boxer
acted
as
the
lead
opposition
to
Santorum
's
abortion
bill
and
is
known
for
her
support
of
child
abuse
laws
.
Kennedy
was
ranking
member
of
the
Health
,
Education
,
Labor
,
and
Pensions
committee
and
the
Judiciary
committee
(
which
was
involved
with
the
abortion
bill
)
.
4.5
MavenRank
in
Other
Contexts
MavenRank
is
a
general
method
for
finding
central
speakers
in
a
discussion
and
can
be
applied
to
areas
outside
of
political
science
.
One
potential
application
would
be
analyzing
blog
posts
to
find
"
Maven
"
bloggers
by
treating
blogs
as
speakers
and
posts
as
speeches
.
Similarly
,
MavenRank
could
be
used
to
find
central
participants
in
a
newsgroup
,
a
forum
,
or
a
collection
of
email
conversations
.
5
Conclusion
We
have
presented
a
technique
for
identifying
lexically
central
speakers
using
a
graph
based
method
called
MavenRank
.
To
test
our
method
for
finding
central
speakers
,
we
analyzed
the
Congressional
Record
by
creating
a
map
from
the
clusters
of
speeches
to
Senate
committees
and
comparing
the
natural
ranking
committee
members
to
the
output
of
MavenRank
.
We
found
evidence
of
a
possible
relationship
between
the
lexical
centrality
and
committee
rank
of
a
speaker
by
ranking
the
speeches
using
MavenRank
and
computing
the
rank
correlation
with
the
natural
ordering
of
speakers
.
Some
specific
committees
disagreed
with
our
hypothesis
that
MavenRank
and
committee
position
are
correlated
,
which
we
propose
is
because
of
the
non-legislative
aspects
of
those
specific
committees
.
The
results
of
our
experiment
suggest
that
MavenRank
can
indeed
be
used
to
find
central
speakers
in
a
corpus
of
speeches
.
We
are
currently
working
on
applying
our
methods
to
the
US
House
of
Representatives
and
other
records
of
parliamentary
speech
from
the
United
Kingdom
and
Australia
.
We
have
also
developed
a
dynamic
version
of
MavenRank
that
takes
time
into
account
when
finding
lexical
centrality
and
plan
on
using
it
with
the
various
parliamentary
records
.
We
are
interested
in
dynamic
MavenRank
to
go
further
with
the
idea
of
tracking
how
ideas
get
propagated
through
a
network
of
debates
,
including
congressional
records
,
blogs
,
and
newsgroups
.
Acknowledgments
This
paper
is
based
upon
work
supported
by
the
National
Science
Foundation
under
Grant
No.
0527513
,
"
DHB
:
The
dynamics
of
Political
Representation
and
Political
Rhetoric
"
.
Any
opinions
,
findings
,
and
conclusions
or
recommendations
expressed
in
this
paper
are
those
of
the
authors
and
do
not
necessarily
reflect
the
views
of
the
National
Science
Foundation
.
