huggingface/transformers: v4.9.0: TensorFlow examples, CANINE, tokenizer training, ONNX rework

Zenodo (CERN European Organization for Nuclear Research)2021

Abstract

v4.9.0: TensorFlow examples, CANINE, tokenizer training, ONNX rework ONNX rework This version introduces a new package, transformers.onnx, which can be used to export models to ONNX. Contrary to the previous implementation, this approach is meant as an easily extendable package where users may define their own ONNX configurations and export the models they wish to export. python -m transformers.onnx --model=bert-base-cased onnx/bert-base-cased/ Validating ONNX model... -[✓] ONNX model outputs' name match reference model ({'pooler_output', 'last_hidden_state'} - Validating ONNX Model output "last_hidden_state": -[✓] (2, 8, 768) matchs (2, 8, 768) -[✓] all values close (atol: 0.0001) - Validating ONNX Model output "pooler_output": -[✓] (2, 768) matchs (2, 768) -[✓] all values close (atol: 0.0001) All good, model saved at: onnx/bert-base-cased/model.onnx [RFC] Laying down building stone for more flexible ONNX export capabilities #11786 (@mfuntowicz) CANINE model Four new models are released as part of the CANINE implementation: CanineForSequenceClassification, CanineForMultipleChoice, CanineForTokenClassification and CanineForQuestionAnswering, in PyTorch. The CANINE model was proposed in CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation by Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting. It's among the first papers that train a Transformer without using an explicit tokenization step (such as Byte Pair Encoding (BPE), WordPiece, or SentencePiece). Instead, the model is trained directly at a Unicode character level. Training at a character level inevitably comes with a longer sequence length, which CANINE solves with an efficient downsampling strategy, before applying a deep Transformer encoder. Add CANINE #12024 (@NielsRogge) Compatible checkpoints can be found on the Hub: https://huggingface.co/models?filter=canine Tokenizer training This version introduces a new method to train a tokenizer from scratch based off of an existing tokenizer configuration. from datasets import load_dataset from transformers import AutoTokenizer dataset = load_dataset("wikitext", name="wikitext-2-raw-v1", split="train") # We train on batch of texts, 1000 at a time here. batch_size = 1000 corpus = (dataset[i : i + batch_size]["text"] for i in range(0, len(dataset), batch_size)) tokenizer = AutoTokenizer.from_pretrained("gpt2") new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=20000) Easily train a new fast tokenizer from a given one - tackle the special tokens format (str or AddedToken) #12420 (@SaulLu) Easily train a new fast tokenizer from a given one #12361 (@sgugger) TensorFlow examples The TFTrainer is now entering deprecation - and it is replaced by Keras. With version v4.9.0 comes the end of a long rework of the TensorFlow examples, for them to be more Keras-idiomatic, clearer, and more robust. NER example for Tensorflow #12469 (@Rocketknight1) TF summarization example #12617 (@Rocketknight1) Adding TF translation example #12667 (@Rocketknight1) Deprecate TFTrainer #12706 (@Rocketknight1) TensorFlow implementations HuBERT is now implemented in TensorFlow: Add TFHubertModel #12206 (@will-rice) Breaking changes When load_best_model_at_end was set to True in the TrainingArguments, having a different save_strategy and eval_strategy was accepted but the save_strategy was overwritten by the eval_strategy (the option to keep track of the best model needs to make sure there is an evaluation each time there is a save). This led to a lot of confusion with users not understanding why the script was not doing what it was told, so this situation will now raise an error indicating to set save_strategy and eval_strategy to the same values, and in the case that value is "steps", save_steps must be a round multiple of eval_steps. General improvements and bugfixes UpdateDescription of TrainingArgs param save_strategy #12328 (@sam-qordoba) [Deepspeed] new docs #12077 (@stas00) [ray] try fixing import error #12338 (@richardliaw) [examples/Flax] move the examples table up #12341 (@patil-suraj) Fix torchscript tests #12336 (@LysandreJik) Add flax/jax quickstart #12342 (@marcvanzee) Fixed a typo in readme #12356 (@MichalPitr) Fix exception in prediction loop occurring for certain batch sizes #12350 (@jglaser) Add FlaxBigBird QuestionAnswering script #12233 (@vasudevgupta7) Replace NotebookProgressReporter by ProgressReporter in Ray Tune run #12357 (@krfricke) [examples] remove extra white space from log format #12360 (@stas00) fixed multiplechoice tokenization #12362 (@cronoik) [trainer] add main_process_first context manager #12351 (@stas00) [Examples] Replicates the new --log_level feature to all trainer-based pytorch #12359 (@bhadreshpsavani) [Examples] Update Example Template for --log_level feature #12365 (@bhadreshpsavani) [Examples] Replace print statement with logger.info in QA example utils #12368 (@bhadreshpsavani) Onnx export v2 fixes #12388 (@LysandreJik) [Documentation] Warn that DataCollatorForWholeWordMask is limited to BertTokenizer-like tokenizers #12371 (@ionicsolutions) Update run_mlm.py #12344 (@TahaAslani) Add possibility to maintain full copies of files #12312 (@sgugger) [CI] add dependency table sync verification #12364 (@stas00) [Examples] Added context manager to datasets map #12367 (@bhadreshpsavani) [Flax community event] Add more description to readme #12398 (@patrickvonplaten) Remove the need for einsum in Albert's attention computation #12394 (@mfuntowicz) [Flax] Adapt flax examples to include push_to_hub #12391 (@patrickvonplaten) Tensorflow LM examples #12358 (@Rocketknight1) [Deepspeed] match the trainer log level #12401 (@stas00) [Flax] Add T5 pretraining script #12355 (@patrickvonplaten) [models] respect dtype of the model when instantiating it #12316 (@stas00) Rename detr targets to labels #12280 (@NielsRogge) Add out of vocabulary error to ASR models #12288 (@will-rice) Fix TFWav2Vec2 SpecAugment #12289 (@will-rice) [example/flax] add summarization readme #12393 (@patil-suraj) [Flax] Example scripts - correct weight decay #12409 (@patrickvonplaten) fix ids_to_tokens naming error in tokenizer of deberta v2 #12412 (@hjptriplebee) Minor fixes in original RAG training script #12395 (@shamanez) Added talks #12415 (@suzana-ilic) [modelcard] fix #12422 (@stas00) Add option to save on each training node #12421 (@sgugger) Added to talks section #12433 (@suzana-ilic) Fix default bool in argparser #12424 (@sgugger) Add default bos_token and eos_token for tokenizer of deberta_v2 #12429 (@hjptriplebee) fix typo in mt5 configuration docstring #12432 (@fcakyon) Add to talks section #12442 (@suzana-ilic) [JAX/Flax readme] add philosophy doc #12419 (@patil-suraj) [Flax] Add wav2vec2 #12271 (@patrickvonplaten) Add test for a WordLevel tokenizer model #12437 (@SaulLu) [Flax community event] How to use hub during training #12447 (@patrickvonplaten) [Wav2Vec2, Hubert] Fix ctc loss test #12458 (@patrickvonplaten) Comment fast GPU TF tests #12452 (@LysandreJik) Fix training_args.py barrier for torch_xla #12464 (@jysohn23) Added talk details #12465 (@suzana-ilic) Add TPU README #12463 (@patrickvonplaten) Import check_inits handling of duplicate definitions. #12467 (@Iwontbecreative) Validation split added: custom data files @sgugger, @patil-suraj #12407 (@Souvic) Fixing bug with param count without embeddings #12461 (@TevenLeScao) [roberta] fix lm_head.decoder.weight ignore_key handling #12446 (@stas00) Rework notebooks and move them to the Notebooks repo #12471 (@sgugger) fixed typo in flax-projects readme #12466 (@mplemay) Fix TAPAS test uncovered by #12446 #12480 (@LysandreJik) Add guide on how to build demos for the Flax sprint #12468 (@osanseviero) Add Repository import to the FLAX example script #12501 (@LysandreJik) [examples/flax] clip style image-text training example #12491 (@patil-suraj) [Flax] Fix wav2vec2 pretrain arguments #12498 (@Wikidepia) [Flax] ViT training example #12300 (@patil-suraj) Fix order of state and input in Flax Quickstart README #12510 (@navjotts) [Flax] Dataset streaming example #12470 (@patrickvonplaten) [Flax] Correct flax training scripts #12514 (@patrickvonplaten) [Flax] Correct logging steps flax #12515 (@patrickvonplaten) [Flax] Fix another bug in logging steps #12516 (@patrickvonplaten) [Wav2Vec2] Flax - Adapt wav2vec2 script #12520 (@patrickvonplaten) [Flax] Fix hybrid clip #12519 (@patil-suraj) [RoFormer] Fix some issues #12397 (@JunnYu) FlaxGPTNeo #12493 (@patil-suraj) Updated README #12540 (@suzana-ilic) Edit readme #12541 (@SaulLu) implementing tflxmertmodel integration test #12497 (@sadakmed) [Flax] Adapt examples to be able to use eval_steps and save_steps #12543 (@patrickvonplaten) [examples/flax] add adafactor optimizer #12544 (@patil-suraj) [Flax] Add FlaxMBart #12236 (@stancld) Add a warning for broken ProphetNet fine-tuning #12511 (@JetRunner) [trainer] add option to ignore keys for the train function too (#11719) #12551 (@shabie) MLM training fails with no validation file(same as #12406 for pytorch now) #12517 (@Souvic) [Flax] Allow retraining from save checkpoint #12559 (@patrickvonplaten) Adding prepare_decoder_input_ids_from_labels methods to all TF ConditionalGeneration models #12560 (@Rocketknight1) Remove tf.roll wherever not needed #12512 (@szutenberg) Double check for attribute num_examples #12562 (@sgugger) [examples/hybrid_clip] fix loading clip vision model #12566 (@patil-suraj) Remove logging of GPU count etc from run_t5_mlm_flax.py #12569 (@ibraheem-moosa) raise exception when argu