huggingface/transformers: New tokenizer API, TensorFlow improvements, enhanced documentation & tutorials

Zenodo (CERN European Organization for Nuclear Research)2020

Abstract

New tokenizer API, TensorFlow improvements, enhanced documentation & tutorials New Tokenizer API (@n1t0, @thomwolf, @mfuntowicz) The tokenizers has evolved quickly in version 2, with the addition of rust tokenizers. It now has a simpler and more flexible API aligned between Python (slow) and Rust (fast) tokenizers. This new API let you control truncation and padding deeper allowing things like dynamic padding or padding to a multiple of 8. The redesigned API is explained in detail here #4510 and here: https://huggingface.co/transformers/master/preprocessing.html Notable changes: Possiblity to create NumPy tensors when using return_tensors parameter on tokenizers. Introduced a new enum TensorType to map all the possible tensor backends we support: TensorType.TENSORFLOW, TensorType.PYTORCH, TensorType.NUMPY Tokenizers now accept TensorType enum on encode(...), encode_plus(...), batch_encode_plus(...) tokenizer method for return_tensors parameters. BatchEncoding new property is_fast indicates if the BatchEncoding comes from a Python (slow) tokenizer or a Rust (fast) tokenizer. BatchEncoding is now pickable Several PRs to make the API more stable have been made: [tokenizers] Fix #5081 and improve backward compatibility #5125 (@thomwolf) Tokenizers API developments #5103 (@thomwolf) More clear error message in the use-case of #5169 (@thomwolf) Add more tests on tokenizers serialization - fix bugs #5056 (@thomwolf) [Tokenization] Fix #5181 - make #5155 more explicit - move back the default logging level in tests to WARNING #5252 (@thomwolf) [tokenizers] Several small improvements and bug fixes #5287 Add pad_to_multiple_of on tokenizers (reimport) #5054 (@mfuntowicz) [tokenizers] Updates data processors, docstring, examples and model cards to the new API #5308 TensorFlow improvements (@jplu, @dzorlu, @LysandreJik) Very big release for TensorFlow! TensorFlow models can now compute the loss themselves, using the TFPretrainedModel.compute_loss method. #4530 Can now resize token embeddings in TensorFlow #4351 Cleaning TensorFlow models #5229 Enhanced documentation (@sgugger) We welcome @sgugger as a team member in New York, who introduces a lot of very cool documentation changes: Added a model summary #4789 Expose classes used in documentation #4808 Explain how to preview the docs in a PR #4795 Clean documentation #4849 Remove old doc page and add note about cache in installation #5027 Fix all sphynx warnings #5068 (@sgugger) Update pipeline examples to doctest syntax #5030 Reorganize documentation #5064 Update installation page and add contributing to the doc #5084 Update glossary #5148 Quick tour #5145 Switch master/stable doc and add older releases #5193 Add version control menu #5222 Don't recreate old docs #5243 Tokenization tutorial #5257 Remove links for all docs #5280 New model sharing tutorial #5323 Training & fine-tuning quickstart Our own @joeddav added a training & fine-tuning quickstart to the documentation #5034! MobileBERT The MobileBERT from MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices by Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou, was added to the library for both PyTorch and TensorFlow. A single checkpoint is added: mobilebert-uncased which is the uncased_L-24_H-128_B-512_A-4_F-4_OPT checkpoint converted to our API. This model was first implemented in PyTorch by @lonePatient, ported to the library by @vshampor, and finalized, added alongside the TensorFlow version by @LysandreJik. Eli5 examples (@yjernite) #4968 The examples/eli5 folder contains training code for the dense retriever and to fine-tune a BART model, the jupyter notebook for the blog post, and the code for the live demo. The RetriBert model implements the dense passage retriever. It's basically a wrapper for two Bert models and projection matrices, but it does gradient checkpointing in a way that is very different from a concurrent PR and I thought it would be easier to write its own class for now and see if we can merge later. The Bart files are only modified to add a reference to the ELI5 fine-tuned model on the model repo. Enhanced examples/seq2seq (@sshleifer) the examples/seq2seq folder is a combination of the old examples/summarization and examples/translation folders. Finetuning works well for summarization, more experiments needed for translation. Summarization finetuning is much improved. It now works on multi-gpu, saves rouge scores during validation, and provides --freeze_encoder and --freeze_embeds options to accelerate finetuning. Distillation only supports summarization. Evaluation works well for both summarization and translation. New weights and biases shared task for collaboration on the XSUM summarization task Distilbart (@sshleifer) Distilbart models are smaller versions of bart-large-cnn and bart-large-xsum. They can be loaded using BartForConditionalGeneration.from_pretrained('sshleifer/distilbart-xsum-12-6'), for example See this tweet for more info on available models and their speed/performance. Commands to reproduce are available in the examples/seq2seq folder BERT Loses Patience (@JetRunner) Add BERT Loses Patience (Patience-based Early Exit) based on the paper https://arxiv.org/abs/2006.04152 and the official implementation https://github.com/JetRunner/PABEE Unifying label arguments (@sgugger) #4722 Deprecate any argument that's not labels (like masked_lm_labels, lm_labels, etc.) to labels. NumPy type in tokenizers (@mfuntowicz) #4585 Introduce a new tensor type for return_tensors on tokenizer for NumPy. As we're introducing more than two tensor backend alternatives I created an enum TensorType listing all the possible tensor we can create TensorType.TENSORFLOW, TensorType.PYTORCH, TensorType.NUMPY. This might help newcomers who don't know about "tf", "pt". Note: TensorType are compatible with previous "tf", "pt" and now "np" str to allow backward compatibility (+unittest) Numpy is now a possible target when creating tensors. This is usefull for JAX. Community notebooks Adding notebooks for Fine Tuning #4732 (@abhimishra91): Multi-class classification: Using DistilBert Multi-label classification: Using Bert Summarization: Using T5 - Model Tracking with WandB Speed up Fine-Tuning in Transformers with Dynamic Padding / Bucketing #5195 (@pommedeterresautee) How to use Benchmarks (@patrickvonplaten) #5312 Benchmarks (@patrickvonplaten) The benchmark script was consolidated and some features were added: Adds the functionality to measure the following functionalities for TF and PT (#4912): Tensorflow: Inference: CPU, GPU, GPU + XLA, GPU + eager mode, CPU + eager mode, TPU PyTorch: Inference: CPU, CPU + torchscript, GPU, GPU + torchscript, GPU + mixed precision, Torch/XLA TPU Training: CPU, GPU, GPU + mixed precision, Torch/XLA TPU [Benchmark] Add encoder decoder to benchmark and clean labels #4810 [Benchmark] add tpu and torchscipt for benchmark #4850 [Benchmark] Extend Benchmark to all model type extensions #5241 [Benchmarks] improve Example Plotter #5245 Hidden states, attentions and cache Before v3.0.0, the way to handle attentions, model hidden states, and whether to use the cache in models that have it for sequential decoding was to specify an argument in the configuration. In version v3.0.0, while we do maintain that argument for backwards compatibility, we introduce a new way of handling these through the forward and call methods. Output attentions #4538 (@Bharat123rox) Output hidden states #4978 (@drjosephliu) Use cache #5194 (@patrickvonplaten) Revamped AutoModels (@patrickvonplaten) The AutoModelWithLMHead encompasses all models with a language modeling head, not making the distinction between causal, masked and seq2seq models. Three new auto models are added: AutoModelForCausalLM for Autoregressive models AutoModelForMaskedLM for Autoencoding models AutoModelForSeq2SeqCausalLM for Sequence-to-sequence models with causal LM for the decoder New model & tokenizer architectures XLMRobertaForQuestionAnswering #4855 (@sgugger) ElectraForQuestionAnswering #4913 (@patil-suraj) Add AlbertForMultipleChoice #4959 (@sgugger) BartForQuestionAnswering #4908 (@patil-suraj) BartTokenizerFast #4878 (@patil-suraj) Add DistilBertForMultipleChoice #5032 (@sgugger) ElectraForMultipleChoice #4954 (@sgugger) ONNX Fixed a bug causing invalid ordering of the inputs in the underlying ONNX IR. Increased logging to giv ethe user more information about the exported variables. BREAKING CHANGES SINCE v2 In #4874 the language modeling BERT has been split in two: BertForMaskedLM and BertLMHeadModel. BertForMaskedLM therefore cannot do causal language modeling anymore, and cannot accept the lm_labels argument. The Trainer data collator is now a method instead of a class Bug fixes and improvements TFRobertaModelIntegrationTest requires tf #4726 (@sshleifer) Cleanup glue for TPU #4621 (@jysohn23) [Reformer] Improved memory if input is shorter than chunk length #4720 (@patrickvonplaten) Pipelines: miscellanea of QoL improvements and small features #4632 (@julien-c) Fix bug when changing the <EOS> token for generate #4745 (@patrickvonplaten) never_split on slow tokenizers should not split #4723 (@mfuntowicz) PretrainedModel.generate: remove unused kwargs #4761 (@sshleifer) Codecov is now setup differently to have better insights into code coverage #4768 (@LysandreJik) Don't access pad_token_id if t

Abstract

Related Papers