Existing open-source helpfulness preference datasets do not specify what makes some responses more helpful and others less so. Models trained on these datasets can incidentally learn to model dataset artifacts (e.g. preferring longer but unhelpful responses only due to their length). To alleviate this problem, we collect HelpSteer, a multi-attribute helpfulness dataset annotated for the various aspects that make responses helpful. Specifically, our 37k-sample dataset has annotations for correctness, coherence, complexity, and verbosity in addition to overall helpfulness of responses. Training Llama 2 70B using the HelpSteer dataset with SteerLM technique produces a model that scores 7.54 on MT Bench, which is currently the highest score for open models that do not require training data from more powerful models (e.g. GPT-4). We release this dataset with CC-BY-4.0 license at https://huggingface.co/datasets/nvidia/HelpSteer
Parameter efficient learning methods (PERMs)have recently gained significant attention asthey provide an efficient way for pre-trainedlanguage models (PLMs) to adapt to a downstream task. However, these conclusions aremostly drawn from in-domain evaluations overthe full training set. In this paper, we presentcomparisons between PERMs and finetuningfrom three new perspectives: (1) the effect ofsample and model size to in-domain evaluations, (2) generalization to unseen domains andnew datasets, and (3) the faithfulness of generations. Our results show that for in-domainsettings (a) there is a cross point of samplesize for which PERMs will perform better thanfinetuning when training with fewer samples,and (b) larger PLMs have larger cross points.For cross-domain and cross-dataset cases, weshow that (a) Adapter (Houlsby et al., 2019)performs the best amongst all the PERMs studied here, and (b) it outperforms finetuning ifthe task dataset is below a certain size. Wealso compare the faithfulness of generationsand show that PERMs can achieve better faithfulness score than finetuning, especially forsmall training set, by as much as 6%. Finally,we apply Adapter to MT-NLG 530b (Smithet al., 2022) and achieve new state-of-the-artresults on Xsum (Narayan et al., 2018) for allROUGE scores (ROUGE-1 49.17, ROUGE-227.20, ROUGE-L 40.98).
This paper provides an overview of NVIDIA NeMo’s neural machine translation systems for the constrained data track of the WMT21 News and Biomedical Shared Translation Tasks. Our news task submissions for English-German (En-De) and English-Russian (En-Ru) are built on top of a baseline transformer-based sequence-to-sequence model (CITATION). Specifically, we use a combination of 1) checkpoint averaging 2) model scaling 3) data augmentation with backtranslation and knowledge distillation from right-to-left factorized models 4) finetuning on test sets from previous years 5) model ensembling 6) shallow fusion decoding with transformer language models and 7) noisy channel re-ranking. Additionally, our biomedical task submission for English ↔ Russian uses a biomedically biased vocabulary and is trained from scratch on news task data, medically relevant text curated from the news task dataset, and biomedical data provided by the shared task. Our news system achieves a sacreBLEU score of 39.5 on the WMT’20 En-De test set outperforming the best submission from last year’s task of 38.8. Our biomedical task Ru-En and En-Ru systems reach BLEU scores of 43.8 and 40.3 respectively on the WMT’20 Biomedical Task Test set, outperforming the previous year’s best submissions.