experiments201602: Codes for experiment of English-Spanish word translation task
=======================

# Requirement
## System Requirements
Memory bottleneck is the parallel processing in `experiment_cleigenwords_taskeval.Rmd`.
The original version of `experiment_cleigenwords_taskeval.Rmd` will use 70GB RAM.

If needed, reduce the number of cores of `cl <- makeCluster(24)`.

## Prerequisite
* pandoc (>=1.12.3)
* R (>=3.2.3)

and following R packages.

```r
packages <- c("Matrix", "svd", "Rcpp", "RcppEigen", "foreach", "doParallel", "knitr", "RecordLinkage", "viridis")
for (package in packages) {
  if (!require(package, character.only = TRUE)) {
    install.packages(package)
  }
}
```


# How to Reproduce Our Results

* Install dependencies
* Execute `main_experiment.sh`

## Output

Following files are generated by `main_experiment.sh`.

|File path                                                |Description                                      |
|:--------------------------------------------------------|:------------------------------------------------|
|res_cleigenwords.Rdata                                   |Output data of CL-Eigenwords                     |
|run_cleigenwords.html                                    |Rendered Rmarkdown of CL-Eigenwords              |
|experiment\_cleigenwords\_taskeval\_es-en\_from\_es.html |Rendered Rmarkdown of translation task (es -> en)|
|experiment\_cleigenwords\_taskeval\_es-en\_from\_en.html |Rendered Rmarkdown of translation task (en -> es)|


# Content

* main_experiment.sh
    * Shell script to execute whole experiment
* kadingir/
    * Package including implementations of Eigenwords, CL-LSI, and CL-Eigenwords
* bilbowa/
    * Cloned from <https://github.com/gouwsmeister/bilbowa>, and then modified.
* run_bilbowa.sh
* test_bilbowa.Rmd


# Preparation of Correct Translation Pairs (Not Required)
## output vocabularies

```r
load("res_cleigenwords.Rdata")
write(paste0(r$vocab.words[[1]], collapse = "\n"), "output_vocab_es-en_es.csv")
write(paste0(r$vocab.words[[2]], collapse = "\n"), "output_vocab_es-en_en.csv")
```

## Make translation pairs using Google Translate
Use `GOOGLETRANSLATE` function of Google Spreadsheet

## Preprocess the output of Google Translates

```sh
python3 preprocess_google_translate.py google_translate_es-en_es{,_preprocessed}.csv
python3 preprocess_google_translate.py google_translate_es-en_en{,_preprocessed}.csv
```