본문 바로가기
AI/파이토치(Pytorch)

[Pytorch][BERT] 버트 소스코드 이해_⑤ BertEmbedding

by Hyen4110 2022. 7. 6.

[Pytorch][BERT] 버트 소스코드 이해 목차

BERT  📑 BERT Config    
  📑 BERT Tokenizer      
  📑 BERT Model 📑 BERT Input    
    📑 BERT Output    
    📑 BERT Embedding 👀    
    📑 BERT Pooler    
    📑 BERT Enocder 📑 BERT Layer 📑 BERT SelfAttention
        📑 BERT SelfOtput

1. BertEmbedding 이란?

: BertEocder에 들어갈 임베딩 생성

= WordPiece Embedding + Position Embedding + Segment Embedding

Embedding  설명 default
WordPiece Embedding
(=word_embeddings)
실질적인 입력이 되는 워드 임베딩 단어 집합의 크기: 30,522개
Position Embedding
(=position_embeddings)
위치 정보를 학습하기 위한 임베딩 문장의 최대 길이: 512개
Segment Embedding
(=token_type_embeddings)
두 개의 문장을 구분하기 위한 임베딩 문장의 최대 개수: 2개

https://wikidocs.net/115055

 

 

 

2. BertEmbedding 구성

2.1 word_embeddings

  • 고차원(vacab_size :30522)의 TF vector를 저차원(hidden_size: 768)의 vector로 변환
  • word_embeddigs에만 padding_idx 변수를 주어서, 특정 토큰은 zero vector로 주어서 마스킹 처리하는 역할  

self.word_embeddings = nn.Embedding(config.vocab_size, 
                                    config.hidden_size, 
                                    padding_idx=config.pad_token_id)

<코드 내 설명>

  • inputs_ids’ passed when calling [BertModel] or [TFBertModel]. hidden_size (int, optional, defaults to 768): Dimensionality of the encoder layers and the pooler layer.
  • vocab_size’ (int, optional, defaults to 30522): Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the

 

2.2 position_embeddings

  • 고차원(max_position_embeddings)의 TF vector를 저차원(hidden_size: 768)의 vector로 변환
  • max_position_embeddings? → seqence의 최대 길이 (default:512)

 

 

self.position_embeddings = nn.Embedding(config.max_position_embeddings, 
                                        config.hidden_size)

 

 

<코드 내 설명>

  • max_position_embeddings’ (int, optional, defaults to 512)

    : The maximum sequence length that this model might ever be used with.

    : Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

 

2.3 token_type_embeddings

  • 저차원(type_vocab_size:2)의 token_type vector를 고차원(hidden_size) 임베딩

self.token_type_embeddings = nn.Embedding(config.type_vocab_size, 
                                        config.hidden_size)

 

<소스 코드 내 설명>

  • type_vocab_size’ (int, optional, defaults to 2)
  • : The vocabulary size of the token_type_ids passed when calling [BertModel] or [TFBertModel].

 

2.4 LayerNorm

  • LayerNorm : 한 batch안에서 데이터 샘플단위로 정규화
self.LayerNorm = nn.LayerNorm(config.hidden_size, 
                            eps=config.layer_norm_eps)

 

<코드 내 설명>

  • layer_norm_eps’ (float, optional, defaults to 1e-12)
  • : The epsilon used by the layer normalization layers.

 

 

2.5 dropout

  • Dropout
self.dropout = nn.Dropout(config.hidden_dropout_prob)

<코드 내 설명>

  • hidden_dropout_prob‘(float, optional, defaults to 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

 

 

2.6 기타 (변수들)

1) “position_ids”

: seqeunce의 max_length만큼 순서 index값을 가진 tensor 생성

: shape = torch.Size([1, max_sequence_length])

(max_sequence_length = max_position_embeddings)

 self.register_buffer("position_ids", 
                    torch.arange(config.max_position_embeddings).expand((1, -1)))

 

2) “position_embedding_type “

self.position_embedding_type = getattr(config, 
                                    "position_embedding_type", 
                                    "absolute")

<코드 내 설명>

<BertEmbeddings init함수>

class BertEmbeddings(nn.Module):
    """Construct the embeddings from word, position and token_type embeddings."""

    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, 
                                            config.hidden_size, 
                                            padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, 
																								config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, 
																									config.hidden_size)

        self.LayerNorm = nn.LayerNorm(config.hidden_size, 
                                    eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

        self.position_embedding_type = getattr(config, 
                                                "position_embedding_type", 
                                                "absolute")
        self.register_buffer("position_ids", 
                            torch.arange(config.max_position_embeddings).expand((1, -1)))

        if version.parse(torch.__version__) > version.parse("1.6.0"):
            self.register_buffer(
                "token_type_ids",
                torch.zeros(self.position_ids.size(), dtype=torch.long),
                persistent=False,
            )

 

 

3. BertEmbedding foward 과정

inputs_embeds = self.word_embeddings(input_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
position_embeddings = self.position_embeddings(position_ids)

embeddings = inputs_embeds + token_type_embeddings + position_embeddings
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)

<original full code>

def forward(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        token_type_ids: Optional[torch.LongTensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        past_key_values_length: int = 0,
    ) -> torch.Tensor:

        if input_ids is not None:
            input_shape = input_ids.size()
        else:
            input_shape = inputs_embeds.size()[:-1]

        seq_length = input_shape[1]

        if position_ids is None:
            position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]

        # Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs
        # when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids, solves
        # issue #5664
        if token_type_ids is None:
            if hasattr(self, "token_type_ids"):
                buffered_token_type_ids = self.token_type_ids[:, :seq_length]
                buffered_token_type_ids_expanded = buffered_token_type_ids.expand(input_shape[0], seq_length)
                token_type_ids = buffered_token_type_ids_expanded
            else:
                token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)

        if inputs_embeds is None:
            inputs_embeds = self.word_embeddings(input_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = inputs_embeds + token_type_embeddings
        if self.position_embedding_type == "absolute":
            position_embeddings = self.position_embeddings(position_ids)
            embeddings += position_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

<small questions>

- register_buffer 쓰는 이유?

- pre-trained BERT의 단어임베딩 table 어떻게 확인할 수 있나?

- nn.Embedding과 nn.Linear의 다른 점?

 

댓글