SYNTAXSUM: GIBBS SAMPLER FOR CONTENT-SYNTAX MODEL AND SUMBASICH SUMMARIZER

William M. Darling
(C) Copyright 2010

This is free software, you can redistribute it and/or modify it under
the terms of the GNU General Public License.

The GNU General Public License does not permit this software to be
redistributed in proprietary programs.

This software is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 50 Temple Place, Suite 330, Boston, MA 02111-1307
USA

----------------------------------------------------------------------

Included files:

- README: this file

- prepare_data.py: Python script that creates data files for use with Gibbs sampler
- topics.py: Python script that allows simple analysis of the learned topics

- main.cpp: Main file for running Gibbs Sampling using the syntaxsum class
- syntaxsum.cpp: C++ class implementing Gibbs Sampling for the SyntaxSum model
- syntaxsum.h: Header file for syntaxsum class
- Makefile: Makefile script to build Gibbs Sampler

- sumbasic.py: Python script implementing SumBasic
- syntax_sumA.py: Python script implementing SBH


Use:

To use the summarization system in this package, the topic and syntax distributions must first be learned from the data. The code is set up by default to use the DUC 2006 dataset where there are 1250 total files organized in 50 document sets of 25 documents each. It can very easily be adapted to be used with another dataset by changing the #define settings in main.cpp and the constants in the summarization python scripts. The names are self explanatory.

The summarization and data preparation python scripts depend on the NLTK library which is freely available at http://www.nltk.org.

TO LEARN THE TOPIC AND SYNTAX CLASS WORD DISTRIBUTIONS
------------------------------------------------------

The C++ Gibbs Sampler can be built by simply typing "make" -- no special libraries are required.

The dataset must be put into the proper format to be used with the Gibbs Sampler. Two files are required. The first, the word stream, is a single-line file with every word in the corpus separated by a space and depicted by its index in a sequential vocabulary that starts at index=1. Each sentence is terminated by the special index 0. The second file, the data stream, is also a single-line file but this one depicts what document each word is from. If there are 5 words in the first document, 3 in the second, and 4 in the third, then the document stream would be simply:

1 1 1 1 1 2 2 2 3 3 3 3

The included script "prepare_data.py" will take a collection of plain text documents in a directory as input and create the required word stream and document stream files as described above ("WS" and "DS") and the required vocabulary file "WO". The script can be run as follows:

python prepare_data.py DATA

where DATA is a directory containing the plain text files.

To run the Gibbs Sampler on this dataset, simply type "./syntaxsum WS DS" where WS is the name of the word stream file and DS is the name of the document stream file (these are the default filenames that will be created when using the prepare_data.py script). Three files will be created. The first, "zeta", is a matrix in the LDA-style that allows simple analysis of the learned topics through the python script topics.py. The other two files, WP.txt and MP.txt, contain the probabilities of words for each topic and each syntax class respectively, and are used with the summarization scripts syntax_sumA.py and syntax_sumB.py.

TO CREATE THE MULTI-DOCUMENT SUMMARIES
--------------------------------------

Once the distributions have been learned, the summarization scripts can then be used to create summaries of the input documents. The scripts are set up to be used with the DUC 2006 dataset (but this can be changed very easily). By default, the input documents should be in a directory called "DUC" and each file should be sequentially named (in order) from 0 to 1249. The summaries are by default placed in a directory called "OUT" and are numbered sequentially from 0 to 49 for each of the 50 document sets.

To run the summarizer with the original SumBasic redundancy removal and the learned topic distributions (the "SBH" summarizer in the paper), type:

python syntax_sumA.py DIR

where DIR is the directory containing the file WP.txt. The SumBasic algorithm can be run by simply typing "python sumbasic.py" as it does not require any a priori learned distributions.

TO RUN EXPERIMENTS
------------------

While the included software is set up by default to work with the DUC 2006 dataset, we do not have permission to distribute this data. Nevertheless, anyone can access the data provided they register with NIST. The data and requisite forms are available at http://duc.nist.gov/data.html.

Further, to perform other multi-document summarization, the settings at the top of the Python scripts (and the #defines in main.cpp) must simply be changed to fit the plain-text dataset that is to be summarized. The location of the data must be set, along with the total number of files, the number of files per document set, and the maximum length of the summaries.

To determine the ROUGE scores for the generated summaries, the ROUGE toolkit can be downloaded at http://berouge.com. Like the DUC data, we cannot distribute the ROUGE software with this package.

APPENDIX
--------

Here is an example of the output from topics.py (the top 10 words for each document set topic) for the DUC 2006 dataset after running the Gibbs Sampler for 1000 iterations. This output can be achieved by running the topics.py script as follows:

python topics.py zeta WO 10

where "zeta" is the topics distribution file created by the Gibbs sampler, "WO" is the vocabulary file created by prepare_data.py, and the top 10 words for each document set topic will be displayed.

Document Set 0
---------------
indian
reservation
tribe
tribes
tribal
indians
gambling
state
casino
american

Document Set 1
---------------
steroids
athletes
drug
positive
steroid
women
anabolic
her
doping
use

Document Set 2
---------------
wetlands
corps
water
protection
environmental
acres
new
city
permit
state

Document Set 3
---------------
star
wars
movie
film
phantom
menace
fans
lucas
first
million

Document Set 4
---------------
arthritis
pain
drug
patients
drugs
celebrex
osteoarthritis
disease
vioxx
merck

Document Set 5
---------------
climate
global
warming
change
ice
scientists
report
sea
world
greenhouse

Document Set 6
---------------
china
government
chinese
beijing
police
state
rights
people
unrest
workers

Document Set 7
---------------
safety
seat
air
child
seats
vehicle
bags
new
children
car

Document Set 8
---------------
settlements
israel
west
bank
settlement
israeli
peace
jewish
palestinian
jerusalem

Document Set 9
---------------
home
school
children
schooling
schools
parents
education
public
her
students

Document Set 10
---------------
organic
york
oct
new
garden
farming
soil
photo
plants
farm

Document Set 11
---------------
autism
children
autistic
brain
parents
study
disorder
secretin
child
researchers

Document Set 12
---------------
generation
x
young
and
wine
gen
xers
age
people
boomers

Document Set 13
---------------
quebec
canada
bouchard
referendum
government
province
french
charest
quebecers
independence

Document Set 14
---------------
evolution
science
board
standards
state
school
kansas
theory
teachers
education

Document Set 15
---------------
chechnya
russian
chechen
russia
federal
terrorist
moscow
troops
republic
attacks

Document Set 16
---------------
plane
flight
crash
egyptair
investigators
board
pilot
hall
recorder
data

Document Set 17
---------------
malaria
health
africa
disease
prevention
control
world
african
vaccine
year

Document Set 18
---------------
gay
republicans
bush
republican
party
rights
gop
gays
log
cabin

Document Set 19
---------------
school
students
schools
violence
high
security
columbine
year
gun
police

Document Set 20
---------------
hong
kong
police
china
crime
chinese
gang
cheung
criminal
macau

Document Set 21
---------------
virus
west
nile
birds
mosquitoes
york
infected
new
health
encephalitis

Document Set 22
---------------
smoking
public
tobacco
places
anti
health
smokers
ban
smoke
law

Document Set 23
---------------
police
lawrence
london
black
stephen
report
racism
britain
inquiry
white

Document Set 24
---------------
kenya
aids
diseases
health
nairobi
malaria
government
african
kenyan
disease

Document Set 25
---------------
embassy
tanzania
bombing
kenya
u
nairobi
es
salaam
dar
bomb

Document Set 26
---------------
children
adoption
parents
adoptions
adopted
families
child
international
agencies
treaty

Document Set 27
---------------
adhd
children
ritalin
disorder
drug
school
attention
child
deficit
study

Document Set 28
---------------
virus
computer
melissa
mail
it
e
computers
viruses
software
smith

Document Set 29
---------------
book
booksellers
books
amp
independent
barnes
noble
com
sales
online

Document Set 30
---------------
concorde
france
air
crash
plane
french
british
paris
flight
tire

Document Set 31
---------------
mongolia
china
relations
mongolian
economic
two
cooperation
visit
minister
bilateral

Document Set 32
---------------
crime
drug
prison
rate
crack
police
percent
state
offenders
crimes

Document Set 33
---------------
salmon
fish
dams
river
fisheries
pacific
species
dam
endangered
snake

Document Set 34
---------------
bush
death
texas
penalty
governor
case
execution
state
board
executed

Document Set 35
---------------
gm
uaw
workers
union
parts
ford
strike
plants
delphi
contract

Document Set 36
---------------
solar
energy
electricity
power
china
areas
panels
water
use
world

Document Set 37
---------------
jupiter
galileo
europa
moon
io
spacecraft
ocean
surface
earth
scientists

Document Set 38
---------------
chemical
plant
weapons
u
vx
united
officials
sudan
factory
iraq

Document Set 39
---------------
submarine
kursk
russian
navy
officials
sea
nuclear
crew
rescue
russia

Document Set 40
---------------
warming
global
climate
scientists
earth
century
temperature
change
ice
atmospheric

Document Set 41
---------------
chavez
venezuela
president
oil
caracas
government
venezuelan
hugo
cuba
political

Document Set 42
---------------
el
nino
weather
pacific
la
nina
ocean
normal
temperatures
warming

Document Set 43
---------------
tax
budget
surplus
clinton
greenspan
federal
billion
security
economy
social

Document Set 44
---------------
housing
income
low
families
federal
cuomo
vouchers
poor
affordable
units

Document Set 45
---------------
perjury
clinton
case
lewinsky
sexual
president
he
judge
jury
federal

Document Set 46
---------------
elian
miami
boy
news
father
cuba
service
relatives
gonzalez
court

Document Set 47
---------------
disorder
drug
mental
anxiety
children
prozac
disorders
compulsive
brain
obsessive

Document Set 48
---------------
putin
russia
president
election
yeltsin
russian
moscow
his
state
government

Document Set 49
---------------
carter
center
president
elections
election
former
jimmy
government
country
nigeria
