# INLG 2022: Generation of Student Questions for Inquiry-based Learning


## Data used in paper
Can be found at ``data/text_pairs.pkl``. Consists of lecture window-question pairs.

## Code used in paper

``transformers_mod/src/prefix.ipynb`` is the main notebook used for experiments. Some values are hard coded (depending on docTTTTTquery, t5-base, and hyperparameters), so please refer to the notebook for how to reproduce results. 

## Run results

Can be found at ``transformers_mod/src/runs``. 

## Raw Data
- ``data/mooc_data`` contains the raw MOOC transcripts in both srt and txt formats. We use the srt format for the timestamps.
- ``data/questions`` contains the raw questions. Most are formatted as ?L[LECTURE NUMBER]: [START TIME]-[END TIME]: [QUESTION]. Note that there are also quizzes, which are ignored in this project. Note that the lecture numbers need to be mapped to the transcript names, which we describe in the next section.

## Data preprocessing

### Questions
- The parsed questions are saved at ``data/parsed/questions``, which are further cleaned manually and saved in ``data/questions_cleaned``. They are saved in `.csv` files with the following fieldnames:
    - **name(str)**: question file name
    - **lecture(str)**: lecture name
    - **time(str)**: timespan containing both start and end time
    - **text(str)**: question text
For those that can't be processed by this simple rule, the file name and the line number of the question is saved into `./data/parsed/questions/messy_data.csv`, which contains the following fieldnames:
    - **name(str)**: name of the file from which the messy data comes from
    - **line(int)**: line number of the messy data

### Transcripts
- The transcripts and related information for each lecture are saved in `<lecture-name>.csv`. For example, `1 - 1 - Course Welcome (00-03-11).srt` will be processed into `1 - 1 - Course Welcome (00-03-11).csv`. These files are saved in ``data/parsed/transcripts``. 
- Each `.csv` file contains the following fieldnames:
    - **name(str)**: transcript file name
    - **id(int)**: transcript id
    - **from(float)**: start time of the transcript
    - **to(float)**: end time of the transcript
    - **text(str)**: transcript text

### Lecture number - transcript name mapping


#### Week 1
- 2 - 1 - 1.1 Natural Language Content Analysis (00-21-05).csv --> 1.1 
- 2 - 2 - 1.2 Text Access (00-09-24).csv --> 1.2
- 2 - 3 - 1.3 Text Retrieval Problem (00-26-18).csv -- > 1.3
- 2 - 4 - 1.4 Overview of Text Retrieval Methods (00-10-10).csv --> 1.4
- 2 - 5 - 1.5 Vector Space Model- Basic Idea (00-09-44).csv --> 1.5
- 2 - 6 - 1.6 Vector Space Model- Simplest Instantiation (00-17-30).csv --> 1.6

#### Week 2
- 2 - 7 - 1.7 Vector Space Model- Improved Instantiation (00-16-52).csv --> 2.1
- 2 - 8 - 1.8 TF Transformation (00-09-31).csv --> 2.2
- 2 - 9 - 1.9 Doc Length Normalization (00-18-56).csv --> 2.3
- 3 - 1 - 2.1 Implementation of TR Systems (00-21-27).csv --> 2.4
- 3 - 2 - 2.2 System Implementation- Inverted Index Construction (00-18-21).csv --> 2.5
- 3 - 3 - 2.3 System Implementation- Fast Search (00-17-11).csv --> 2.6

#### Week 3
- 3 - 4 - 2.4 Evaluation of TR Systems (00-10-10).csv --> 3.1
- 3 - 5 - 2.5 Evaluation of TR Systems- Basic Measures (00-12-54).csv --> 3.2
- 3 - 6 - 2.6 Evaluation of TR Systems- Evaluating Ranked Lists Part 1 (00-12-51).csv --> 3.3
- 3 - 7 - 2.6 Evaluation of TR Systems- Evaluating Ranked Lists Part 2 (00-10-01) .csv --> 3.4
- 3 - 8 - 2.7 Evaluation of TR Systems- Multi-Level Judgements (00-10-48).csv --> 3.5
- 3 - 9 - 2.8 Evaluation of TR Systems- Practical Issues (00-15-14).csv --> 3.6

#### Week 4
- 4 - 1 - 3.1 Probabilistic Retrieval Model- Basic Idea (00-12-44).csv --> 4.1
- 4 - 2 - 3.2 Statistical Language Models (00-17-53).csv --> 4.2
- 4 - 3 - 3.3 Query Likelihood Retrieval Function (00-12-07).csv --> 4.3
- 4 - 4 - 3.4 Smoothing of Language Model - Part 1 (00-12-15).csv --> 4.4
- 4 - 5 - 3.4 Smoothing of Language Model - Part 2 (00-09-36).csv --> 4.5
- 4 - 6 - 3.5 Smoothing Methods Part - 1 (00-09-54).csv --> 4.6
- 4 - 7 - 3.5 Smoothing Methods Part - 2 (00-13-17).csv --> 4.7

#### Week 5
- 4 - 8 - 3.6 Feedback in Text Retrieval (00-06-49).csv --> 5.1
- 4 - 9 - 3.7 Feedback in Vector Space Model- Rocchio (00-12-05).csv --> 5.2
- 4 - 10 - 3.8 Feedback in Text Retrieval- Feedback in LM (00-19-11).csv --> 5.3
- 5 - 1 - 4.1 Web Search- Introduction & Web Crawler (00-11-05).csv --> 5.4
- 5 - 2 - 4.2 Web Indexing (00-17-19).csv --> 5.5
- 5 - 3 - 4.3 Link Analysis - Part 1 (00-09-16).csv --> 5.6
- 5 - 4 - 4.3 Link Analysis - Part 2 (00-17-30).csv --> 5.7
- 5 - 5 - 4.3 Link Analysis - Part 3 (00-05-59).csv --> 5.8

#### Week 6
- 5 - 6 - 4.4 Learning to Rank Part 1 (00-13-09).csv --> 6.1
- 5 - 7 - 4.4 Learning to Rank - Part 2 (00-05-54).csv --> 6.2
- 5 - 8 - 4.4 Learning to Rank - Part 3 (00-04-58).csv --> 6.3
- 5 - 9 - 4.5 Future of Web Search (00-13-09).csv --> 6.4
- 5 - 10 - 4.6 Recommender Systems- Content-based Filtering - Part 1 (00-12-55).csv --> 6.5
- 5 - 11 - 4.6 Recommender Systems- Content-based Filtering - Part 2 (00-10-42).csv --> 6.6
- 5 - 12 - 4.7 Recommender Systems- Collaborative Filtering - Part 1 (00-06-20).csv --> 6.7
- 5 - 13 - 4.7 Recommender Systems- Collaborative Filtering - Part 2 (00-12-09).csv --> 6.8
- 5 - 14 - 4.7 Recommender Systems- Collaborative Filtering - Part 3 (00-04-45).csv --> 6.9

#### Week 7
- 2 - 1 - 1.1 Overview Text Mining and Analytics- Part 1 (00-11-43).csv --> 7.1
- 2 - 2 - 1.2 Overview Text Mining and Analytics- Part 2 (00-11-44).csv --> 7.2
- 2 - 3 - 1.3 Natural Language Content Analysis- Part 1 (00-12-48).csv --> 7.3
- 2 - 4 - 1.4 Natural Language Content Analysis- Part 2 (00-04-25).csv --> 7.4
- 2 - 5 - 1.5 Text Representation- Part 1 (00-10-46).csv --> 7.5
- 2 - 6 - 1.6 Text Representation- Part 2 (00-09-29).csv --> 7.6
- 2 - 7 - 1.7 Word Association Mining and Analysis (00-15-39).csv --> 7.7
- 2 - 8 - 1.8 Paradigmatic Relation Discovery Part 1 (00-14-31).csv --> 7.8
- 2 - 9 - 1.9 Paradigmatic Relation Discovery Part 2 (00-17-53).csv --> 7.9

#### Week 8
- 2 - 10 - 1.10 Syntagmatic Relation Discovery- Entropy (00-11-00).csv --> 8.1
- 2 - 11 - 1.11 Syntagmatic Relation Discovery- Conditional Entropy (00-11-57).csv --> 8.2
- 2 - 12 - 1.12 Syntagmatic Relation Discovery- Mutual Information- Part 1 (00-13-55).csv --> 8.3
- 2 - 13 - 1.13 Syntagmatic Relation Discovery- Mutual Information- Part 2 (00-09-42).csv --> 8.4
- 3 - 1 - 2.1 Topic Mining and Analysis- Motivation and Task Definition (00-07-36).csv --> 8.5
- 3 - 2 - 2.2 Topic Mining and Analysis- Term as Topic (00-11-31).csv --> 8.6
- 3 - 3 - 2.3 Topic Mining and Analysis- Probabilistic Topic Models (00-14-17).csv --> 8.7
- 3 - 4 - 2.4 Probabilistic Topic Models- Overview of Statistical Language Models- Part 1 (00-10-25).csv --> 8.8
- 3 - 5 - 2.5 Probabilistic Topic Models- Overview of Statistical Language Models- Part 2 (00-13-11).csv --> 8.9
- 3 - 6 - 2.6 Probabilistic Topic Models- Mining One Topic (00-12-21).csv --> 8.10

#### Week 9
- 3 - 7 - 2.7 Probabilistic Topic Models- Mixture of Unigram Language Models (00-12-39).csv --> 9.1
- 3 - 8 - 2.8 Probabilistic Topic Models- Mixture Model Estimation- Part 1 (00-10-16).csv --> 9.2
- 3 - 9 - 2.9 Probabilistic Topic Models- Mixture Model Estimation- Part 2 (00-08-15).csv --> 9.3
- 3 - 10 - 2.10 Probabilistic Topic Models- Expectation-Maximization Algorithm- Part 1 (00-11-05).csv --> 9.4
- 3 - 11 - 2.11 Probabilistic Topic Models- Expectation-Maximization Algorithm- Part 2 (00-10-39).csv --> 9.5
- 3 - 12 - 2.12 Probabilistic Topic Models- Expectation-Maximization Algorithm- Part 3 (00-06-25).csv --> 9.6
- 3 - 13 - 2.13 Probabilistic Latent Semantic Analysis (PLSA)- Part 1 (00-10-38).csv --> 9.7
- 3 - 14 - 2.14 Probabilistic Latent Semantic Analysis (PLSA)- Part 2 (00-10-15).csv --> 9.8
- 3 - 15 - 2.15 Latent Dirichlet Allocation (LDA)- Part 1 (00-10-20).csv --> 9.9
- 3 - 16 - 2.16 Latent Dirichlet Allocation (LDA)- Part 2 (00-12-03).csv --> 9.10

#### Week 10
- 4 - 1 - 3.1 Text Clustering- Motivation (00-15-52).csv --> 10.1
- 4 - 2 - 3.2 Text Clustering- Generative Probabilistic Models Part 1 (00-16-18).csv --> 10.2
- 4 - 3 - 3.3 Text Clustering- Generative Probabilistic Models Part 2 (00-08-37).csv --> 10.3
- 4 - 4 - 3.4 Text Clustering- Generative Probabilistic Models Part 3 (00-14-55).csv --> 10.4
- 4 - 5 - 3.5 Text Clustering- Similarity-based Approaches (00-17-48).csv --> 10.5
- 4 - 6 - 3.6 Text Clustering- Evaluation (00-10-11).csv --> 10.6
- 4 - 7 - 3.7 Text Categorization- Motivation (00-14-37).csv --> 10.7
- 4 - 8 - 3.8 Text Categorization- Methods (00-11-50).csv --> 10.8
- 4 - 9 - 3.9 Text Categorization- Generative Probabilistic Models (00-31-18).csv --> 10.9

#### Week 11
- 4 - 10 - 3.10 Text Categorization- Discriminative Classifier Part 1 (00-20-34).csv --> 11.1
- 4 - 11 - 3.11 Text Categorization- Discriminative Classifier Part 2 (00-31-46).csv --> 11.2
- 4 - 12 - 3.12 Text Categorization- Evaluation Part 1 (00-14-12).csv --> 11.3
- 4 - 13 - 3.13 Text Categorization- Evaluation Part 2 (00-10-51).csv --> 11.4
- 5 - 1 - 4.1 Opinion Mining and Sentiment Analysis- Motivation (00-17-51).csv --> 11.5
- 5 - 2 - 4.2 Opinion Mining and Sentiment Analysis- Sentiment Classification (00-11-47).csv --> 11.6
- 5 - 3 - 4.3 Opinion Mining and Sentiment Analysis- Ordinal Logistic Regression (00-13-43).csv --> 11.7

#### Week 12
- 5 - 4 - 4.4 Opinion Mining and Sentiment Analysis- Latent Aspect Rating Analysis Part 1 (00-15-17).csv --> 12.1
- 5 - 5 - 4.5 Opinion Mining and Sentiment Analysis- Latent Aspect Rating Analysis Part 2 (00-14-43).csv --> 12.2
- 5 - 6 - 4.6 Text-Based Prediction (00-12-08).csv --> 12.3
- 5 - 7 - 4.7 Contextual Text Mining- Motivation (00-06-47).csv --> 12.4
- 5 - 8 - 4.8 Contextual Text Mining- Contextual Probabilistic Latent Semantic Analysis (00-17-59).csv --> 12.5
- 5 - 9 - 4.9 Contextual Text Mining- Mining Topics with Social Network Context (00-14-43).csv --> 12.6
- 5 - 10 - 4.10 Contextual Text Mining- Mining Casual Topics with Time Series Supervision (00-19-37).csv --> 12.7
- 5 - 11 - 4.11 Course Summary (00-18-36).csv --> 12.8


