본문 바로가기
AI/파이토치(Pytorch)

[Pytorch][BERT] 버트 소스코드 이해_② BertConfig

by Hyen4110 2022. 7. 5.

[Pytorch][BERT] 버트 소스코드 이해 목차

BERT  📑 BERT Config 👀    
  📑 BERT Tokenizer      
  📑 BERT Model 📑 BERT Input    
    📑 BERT Output    
    📑 BERT Embedding    
    📑 BERT Pooler    
    📑 BERT Enocder 📑 BERT Layer 📑 BERT SelfAttention
        📑 BERT SelfOtput

 

 

BertConfig

<소스 코드> configuration_bert.py

class BertConfig(PretrainedConfig):
		def __init__(
	        self,
	        vocab_size=30522,
	        hidden_size=768,
	        num_hidden_layers=12,
	        num_attention_heads=12,
	        intermediate_size=3072,
	        hidden_act="gelu",
	        hidden_dropout_prob=0.1,
	        attention_probs_dropout_prob=0.1,
	        max_position_embeddings=512,
	        type_vocab_size=2,
	        initializer_range=0.02,
	        layer_norm_eps=1e-12,
	        pad_token_id=0,
	        position_embedding_type="absolute",
	        use_cache=True,
	        classifier_dropout=None,
	        **kwargs
	    ):

 

🐸 Config에 있는 parameter들

parateter  description  default
vocab_size BERT의 Vocabulary 크기/ 고유한 토큰의 개수 30522
hidden_size Encoder 와 Pooler 층의 차원 수 768
num_hidden_layers Encoder의 hiden layer 수 12
num_attention_heads Encoder가 가지는 attention head 수 12
intermediate_size Encoder의 intermediate(=feed-forward) 차원 수 3072
hidden_act Encoder와 Pooler의 활성화 함수 "gelu”
hidden_dropout_prob Embedding과 Encoder와 Pooler의 각 완전연결층의 droptout 비율 0.1
attention_probs_dropout_prob attention probabilities 의 dropout 비율 0.1
max_position_embeddings 모델이 처리할 수 있는 sequence의 최대 길이 512
type_vocab_size token_type_ids의 vocabulary 크기 2
initializer_range 모든 가중치 벡터 초기화에 쓰이는 표준 편차 값 0.02
layer_norm_eps layer normalization layers에 쓰이는 epsilon 값 1e-12
position_embedding_type position embedding의 유형 ("absolute", "relative_key", "relative_key_query") "absolute”
use_cache 모델이 마지막 key/value attention들을 반환할 것인가 여부 (is_decoder=True 일 때만 의미있음)  
classifier_dropout classification head의 dropout 비율  

 

 

<Enlish version>

parateter  description  default
vocab_size Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the ‘inputs_ids’ passed when calling [BertModel] or [TFbertModel] 30522
hidden_size Dimensionality of the encoder layers and the pooler layer. 768
num_hidden_layers Number of hidden layers in the Transformer encoder. 12
num_attention_heads Number of attention heads for each attention layer in the Transformer encoder. 12
intermediate_size Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder. 3072
hidden_act The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "gelu”
hidden_dropout_prob The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. 0.1
attention_probs_dropout_prob The dropout ratio for the attention probabilities. 0.1
max_position_embeddings The maximum sequence length that this model might ever be used with. Typically set this to something large 512
type_vocab_size The vocabulary size of the token_type_ids passed when calling [BertModel] or [TFBertModel]. 2
initializer_range The standard deviation of the truncated_normal_initializer for initializing all weight matrices. 0.02
layer_norm_eps The epsilon used by the layer normalization layers. 1e-12
position_embedding_type Type of position embedding. Choose one of "absolute", "relative_key", "relative_key_query". For positional embeddings use "absolute". For more information on `"relative_key" "absolute”
use_cache Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True  
classifier_dropout The dropout ratio for the classification head.  

댓글