UBC-NLP/MARBERT
UBC-NLP/MARBERT is machine learning model.
About UBC-NLP/MARBERT
MARBERT is a large-scale pre-trained masked language model focused on both Dialectal Arabic (DA) and MSA . We randomly sample 1B Arabic tweets from a large in-house dataset of about 6B tweets . We only include tweets with at least 3 Arabic words, based on character string matching, regardless whether the tweet has non-Arabic string or not . The dataset makes up 128GB of text (15.6B tokens) We use the same network architecture as ARBERT (BERT-base), but without the next sentence prediction (NSP) objective since tweets are short . See our repo for modifying BERT code to remove NSP . For,