sarahlintang/IndoBERT
Sarahlintang/IndoBERT is machine learning model.
About sarahlintang/IndoBERT
IndoBERT is a pre-trained language model based on BERT architecture for the Indonesian Language . The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2 . This model is equal to bert-base model which has 32,000 vocabulary size . It was proven that this model outperforms multilingual BERT for all downstream tasks. The training was done using a Google Cloud Storage bucket, for persistent storage of training data and models. The model is based on 16 GB of raw text, 2 B words from Oscar Corpus (https://oscar-corpus.com/). The training procedure has been,