This package contains a Java implementation of a text segmentation algorithm APS, Affinity Propagation for Segmentation. It 
is available under GNU General public license, please see the enclosed file LICENSE.txt for details. 

While this implementation is far from fully optimized, it is parallelized to make use of all available cores, so the CPUs will be fully loaded while running it.

CONTACT INFORMATION:
To be available

CONTENTS OF THE PACKAGE: 
 
./bin : compiled  class files
./build/jar : output directory for ANT
	./build/jar/APS.jar - a version of the segmenter than can be run from the command line. 
	./build/jar/lingpipe-3.9.1.jar - for convenience, the external dependency library is doubled here. If you do not have it in the same directory as APS.jar, then you must put  it java class path 
		(for example something like this: java -jar -cp .:some_directory/lingpipe-3.9.1.jar APS.jar -config config_file_path)
	
./data : three datasets used for the experiments described in the paper. 'dev' files were used to select parameters. 'ref' files were used for testing.
	./data/ai_manual : manually transcribed and segmented lectures of a course on Artificial Intelligence. It was compiled and made available by Malioutov and Barzilay (2006)
	./data/clinical : chapters of medical textbooks segmented into sections. Compiled and made available by Eisenstein abd Barzilay (2008)
	./fiction : novels and short stories downloaded from Project Gutenberg (http://www.gutenberg.org/wiki/Main_Page). It was downloaded and compiled using a machine-
	readable catalogue. Segment breaks corrspond to chapter breaks or breaks between stories as marked up in HTML sources.
	
./config : contains config files used to obtain the results reported in the paper (ai.config, clinical.config and fiction.config respectively). It also contains an example config file that can be 
	used to select best parameters on a dataset (example_parameter_tuning.config). Note: for now, the search through the parameter space is exhaustive. So unless you have a very fast 
	machine, it is best be careful and limit the search space. 
	
./lib : contains dependencies necessary to compile the segmenter:
	./lib/lingpipe-3.9.1.jar : a wonderfully useful LingPipe API (http://alias-i.com/lingpipe/). It must be either in Java classpath to run APS.jar (or to compile the source code)
./runAPS.sh : an example shell script envoking APS using APS.jar file. It has an example of using a config file to specify the options and an example of setting them on the command line.
./src : the source code for the APS segmenter. API and the documentation are coming up.
./STOPWORD.list : list of stop words used. From (Malioutov and Barzilay 2006)

INSTALLATION:
Normally you should be able to run the APS.jar file (found in ./build/jar) as it is:

	java -java -jar -Xms1000m -Xmx6500m build/jar/APS.jar -config $APS_DIR/config/ai.config

If you need to rebuild, run ANT from the top directory (ap_segmentation):
	ant clean
	ant 
It will rebuild the jar file and copy the necessary lingpipe-3.9.1.jar to ./build/jar directory. 


USAGE:

The segmenter can be invoked in one of the two modes: run or tune parameters (-run or -tune options respectively). All options may be passed either on the command line, or by passing a config file as 
an argument to -config option. For example, to pass use example.config file to specify all options we can do the following:

	java -cp  .:/lib/lingpipe-3.9.1.jar  segmenter/RunSegmenter -config example.config

example.config may look something like this:
----------------
-tune 
-tuneWinRatios 0.4,0.5,0.6
-tunePrefs -0.5,-0.6,-0.7
-tuneDamps 0.8,0.9
-inputDir /Users/anna/Documents/workspace/segmentation/data/ai_manual
-outputDir /Users/anna/Documents/workspace/segmentation/Parameters_Output/ai_manual
-inputExtensions dev
-corpusExtensions dev,ref
-resultFile ai_test_command_line_tune.txt
-sparse true
-useSegmentDf 
-numTFIDFsegments 15
---------------

Alternatively, we could pass all arguments on the command line:

java -cp  .:/lib/lingpipe-3.9.1.jar:/lib/commandln.jar segmenter/RunSegmenter -config example.config -tune -tuneWinRatios 0.4,0.5,0.6 -tunePrefs -0.5,-0.6,-0.7 -tuneDamps 0.8,0.9 -inputDir /Users/anna/Documents/workspace/segmentation/data/ai_manual -outputDir /Users/anna/Documents/workspace/segmentation/Parameters_Output/ai_manual -inputExtensions dev -corpusExtensions dev,ref -resultFile ai_test_command_line_tune.txt -sparse true -useSegmentDf  -numTFIDFsegments 15


OPTIONS:

1) Mode of running: -run (for running using specific preference, damping factor and windowSize values) or -tune (for selecting the best preference,
	damping factor and windowSize values from pre-specified arrays of available choices). Both -run and -tune require no values.

	1.1) If you set -run option, then the following options must also be set:
	
		-preference <positive or negative double value>
		-damping <positive double value>
		-windowRatio <positive double value that is <=1 > OR windowSize <positive int value>
		
		Preference value corresponds to the negative cost of adding each sentence as a segment center. Its value will depend on the similarity metric used. For experiments described in the paper
		we found that it only makes sense to try negative preference values. However, that is only because our similarities range from 0 to 1. Searching between and median and minimum similarity values
		may be a good starting point for other ranges. Generally, higher preference values will results in small fine-grained segments while lower preference values will results in fewer coarser-grained segments.
		
		In this release, we set all preferences to the same value. However, setting them more intelligently will almost certainly produce better results (e.x., giving higher preference to longer sentences,
		or to the first ones in each paragraph).
		
		Damping factor (-damping <> option) sets the damping factor lambda. It can be any double between 0.5 and 0.999999. Lower lambda results in faster convergence but may lead to oscillations and worse results.
		Higher lambda results in slower convergence.
		
		windowRatio or windowSize specify the size of the siding window for computing similarities between sentences. Sentences that are further apart then the size of the window are considered maximally 
		dissimillar (-INF) and effectively cannot possibly belong in one segment. Because of that, the size of the window should be at least twice the average segment length, better yet twice the expected
		maximum segment length. If you use windowSize then you specify the number of sentences. windowRatio  specified a percentage ratio (e.g., 0.5 or 0.8). winRatio must be < 1.
		
	1.2) If you set -tune option, then you must also specify specific preference, windowSize or windowRatio and damping factor values to search through. These correspond to the following options:
		-tuneWinRatios <comma separated positive double values> OR -tuneWinSizes <comma separated positive int values>
		-tunePrefs <comma separated double values, most likely negative>
		-tuneDamps <comma separated positive double values>
		
	1.3) Regardless of the run mode the following options must be set
	
		-sparse <true | false>: using a sparse version of the segmenter. By default, it should always be set to true. The only reason NOT to set it is if you want to compute the full 
		similarities matrix with no cut-off for windowSize. However, for most documents that are over 400-500 sentences long this will take too long. If the option set to false, then windowSize (or windowRatio)
		as well as tuneWinSizes(or tuneWinRatios) will have no effect.
		
		-smoothing <no value> : whether to smooth counts between adjacent sentence vectors as described in (Malioutov and Barzilay 2006). If this option is set, then you must also specify:
			-smoothingAlpha <double value greater than 0>
			-smoothingWindow <int value>
			
		-useSegmentDf <no value>: if this option is NOT set, then sentence vectors are weighted by tf.idf scores. However, for some documents it is more appropriate to compute tf.idf differently: 
		by splitting a document into N segments and using term frequencies within each segment instead of per-document frequencies. By setting -useSegmentDf you choose the latter option. If you do so, you 
		also need to specify
			
			-numTFIDFsegments <int value> : the number of segments to use for computing segment-based frequencies
			
		-inputExtensions <comma seperated file extensions> : list of extensions to process 
		
		-corpusExtensions <comma seperated file extensions>: list of extensions to use for the computation of conventional tf.idf values.
	

	