huggingface/transformers: CTRL, DistilGPT-2, Pytorch TPU, tokenizer enhancements, guideline requirements

Zenodo (CERN European Organization for Nuclear Research)2019

Abstract

New model architectures: CTRL, DistilGPT-2 Two new models have been added since release 2.0. CTRL (from Salesforce) released with the paper CTRL: A Conditional Transformer Language Model for Controllable Generation, by Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher. This model has been added to the library by @keskarnitish with the help of @thomwolf. DistilGPT-2 (from HuggingFace), as the second distilled model after DistilBERT in version 1.2.0. Released alongside the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter Distillation Several updates have been made to the distillation script, including the possibility to distill GPT-2 and to distill on the SQuAD task. By @VictorSanh. Pytorch TPU support The run_glue.py example script can now run on a Pytorch TPU. Updates to example scripts Several example scripts have been improved and refactored to use the full potential of the new tokenizer functions: run_multiple_choice.py has been refactored to include encode_plus by @julien-c and @erenup run_lm_finetuning.py has been improved with the help of @dennymarcels, @jinoobaek-qz and @LysandreJik run_glue.py has been improved with the help of @brian41005 QOL enhancements on the tokenizer Enhancements have been made on the tokenizers. Two new methods have been added: get_special_tokens_mask and truncate_sequences. The former returns a mask indicating which tokens are special tokens in a token list, and which are tokens from the initial sequences. The latter truncate sequences according to a strategy. Both of those methods are called by the encode_plus method, which itself is called by the encode method. The encode_plus now returns a larger dictionary which holds information about the special tokens, as well as the overflowing tokens. Thanks to @julien-c, @thomwolf, and @LysandreJik for these additions. Breaking changes The two methods add_special_tokens_single_sequence and add_special_tokens_sequence_pair have been removed. They have been replaced by the single method build_inputs_with_special_tokens which has a more comprehensible name and manages both sequence singletons and pairs. The boolean parameter truncate_first_sequence has been removed in tokenizers' encode and encode_plus methods, being replaced by a strategy in the form of a string: 'longest_first', 'only_second', 'only_first' or 'do_not_truncate' are accepted strategies. When the encode or encode_plus methods are called with a specified max_length, the sequences will now always be truncated or throw an error if overflowing. Guidelines and requirements New contributing guidelines have been added, alongside library development requirements by @rlouf, the newest member of the HuggingFace team. Community additions/bug-fixes/improvements GLUE Processors have been refactored to handle inputs for all tasks coming from the tensorflow_datasets. This work has been done by @agrinh and @philipp-eisen. The padding_idx is now correctly initialized to 1 in randomly initialized RoBERTa models. @ikuyamada The documentation CSS has been adapted to work on older browsers. @TimYagan An addition concerning the management of hidden states has been added to the README by @BramVanroy. Integration of TF 2.0 models with other Keras modules @thomwolf Past values can be opted-out @thomwolf