-------------------------------- INTRODUCTION --------------------------------
This submission contains data and code for preliminary experiments demonstrating creative problem solving in LLVMs, inspired by Computational Creativity literature. The code provided in this repository evaluates the capabilities of LLVMs to identify creative object replacements when the required objects are missing, e.g., substituting a bowl for a scoop. The approach evaluates performances of the LLVMs for different prompts, e.g., prompts that are augmented with relevant object features.



-------------------------------- INSTRUCTIONS FOR RUNNING THE CODE --------------------------------
This code requires [Pytorch](https://github.com/pytorch/pytorch) and HuggingFace [Transformers](https://github.com/huggingface/transformers) libraries. To install the necessary packages, run: "pip install -r requirements.txt"

Running with the default seed setting (`seed=42`) will reproduce results in Figures 2-5 from the paper. To run the code use: "python eval_task.py --task-type creative-obj"

Details of the models and task prompts are available in `dataset_cfg.py`. The supported task types include: 
a) `creative` that uses regular prompts in cases where an object replacement is required (Figure 2); 
b) `creative-obj` that adds object feature information (affordance) to the prompt (Figure 3);
c) `creative-task` that adds task information to the prompt (Figure 4);
d) `creative-task-obj` that combines affordance and task information (Figure 5); 
e) `nominal` uses regular prompts, tested on cases where object replacement is not required (Table 1). 

Cases a) to d) require an object replacement, whereas in case e) the desired object is available (used as baseline). The code runs the evaluation (creating random test sets based on the seed) and reports the result via the plots shown in the paper. The full testing dataset consists of 16 RGB images of objects, from which subsets are randomly chosen. See artificial-dataset.

To run with a different seed, use: "python eval_task.py --task-type <type> --seed <seed_num>"



-------------------------------- REPRODUCIBILITY AND COMPUTE RESOURCES NOTE --------------------------------
This code was tested with `Python 3.10.12`. While this code can be executed on CPU, the results were obtained on a single NVIDIA A100 GPU. All the versions of other packages used are noted in `requirements.txt`. We installed Pytorch 1.13.0 for CUDA 11.6. For more details on installation, please see Pytorch [Installation Instructions](https://pytorch.org/get-started/locally/). The seeds used in the experiments include: `42` (Figures 2-5 in the paper), and for the overall average results in Figure 6, we additionally used seeds `18`, `343`, `496`, `471`, `752`, `971`, `206`, `122`, and `947`. The plot in Figure 6 was separately generated by accumulating the results from each run into a spreadsheet.
