[Pytorch][BERT] 버트 소스코드 이해 목차 |
||||
BERT | 📑 BERT Config | |||
📑 BERT Tokenizer | ||||
📑 BERT Model | 📑 BERT Input | |||
📑 BERT Output | ||||
📑 BERT Embedding 👀 | ||||
📑 BERT Pooler | ||||
📑 BERT Enocder | 📑 BERT Layer | 📑 BERT SelfAttention | ||
📑 BERT SelfOtput |
1. BertEmbedding 이란?
: BertEocder에 들어갈 임베딩 생성
= WordPiece Embedding + Position Embedding + Segment Embedding
Embedding | 설명 | default |
WordPiece Embedding (=word_embeddings) |
실질적인 입력이 되는 워드 임베딩 | 단어 집합의 크기: 30,522개 |
Position Embedding (=position_embeddings) |
위치 정보를 학습하기 위한 임베딩 | 문장의 최대 길이: 512개 |
Segment Embedding (=token_type_embeddings) |
두 개의 문장을 구분하기 위한 임베딩 | 문장의 최대 개수: 2개 |
2. BertEmbedding 구성
2.1 word_embeddings
- 고차원(vacab_size :30522)의 TF vector를 저차원(hidden_size: 768)의 vector로 변환
- word_embeddigs에만 padding_idx 변수를 주어서, 특정 토큰은 zero vector로 주어서 마스킹 처리하는 역할
self.word_embeddings = nn.Embedding(config.vocab_size,
config.hidden_size,
padding_idx=config.pad_token_id)
<코드 내 설명>
- ’inputs_ids’ passed when calling [BertModel] or [TFBertModel]. hidden_size (int, optional, defaults to 768): Dimensionality of the encoder layers and the pooler layer.
- ‘vocab_size’ (int, optional, defaults to 30522): Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
2.2 position_embeddings
- 고차원(max_position_embeddings)의 TF vector를 저차원(hidden_size: 768)의 vector로 변환
- max_position_embeddings? → seqence의 최대 길이 (default:512)
self.position_embeddings = nn.Embedding(config.max_position_embeddings,
config.hidden_size)
<코드 내 설명>
- ‘max_position_embeddings’ (int, optional, defaults to 512)
: The maximum sequence length that this model might ever be used with.
: Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
2.3 token_type_embeddings
- 저차원(type_vocab_size:2)의 token_type vector를 고차원(hidden_size) 임베딩
self.token_type_embeddings = nn.Embedding(config.type_vocab_size,
config.hidden_size)
<소스 코드 내 설명>
- ‘type_vocab_size’ (int, optional, defaults to 2)
- : The vocabulary size of the token_type_ids passed when calling [BertModel] or [TFBertModel].
2.4 LayerNorm
- LayerNorm : 한 batch안에서 데이터 샘플단위로 정규화
self.LayerNorm = nn.LayerNorm(config.hidden_size,
eps=config.layer_norm_eps)
<코드 내 설명>
- ‘layer_norm_eps’ (float, optional, defaults to 1e-12)
- : The epsilon used by the layer normalization layers.
2.5 dropout
- Dropout
self.dropout = nn.Dropout(config.hidden_dropout_prob)
<코드 내 설명>
- ‘hidden_dropout_prob‘(float, optional, defaults to 0.1): The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
2.6 기타 (변수들)
1) “position_ids”
: seqeunce의 max_length만큼 순서 index값을 가진 tensor 생성
: shape = torch.Size([1, max_sequence_length])
(max_sequence_length = max_position_embeddings)
self.register_buffer("position_ids",
torch.arange(config.max_position_embeddings).expand((1, -1)))
2) “position_embedding_type “
self.position_embedding_type = getattr(config,
"position_embedding_type",
"absolute")
<코드 내 설명>
- ‘position_embedding_type’ (str, optional, defaults to "absolute"): Type of position embedding.
- Choose one of "absolute", "relative_key", "relative_key_query".
- For more information on "relative_key", please refer to
- Self-Attention with Relative Position Representations (Shaw et al.).
- For more information on "relative_key_query",
- please refer to Method 4 in
- Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).
- For positional embeddings use "absolute".
<BertEmbeddings init함수>
class BertEmbeddings(nn.Module):
"""Construct the embeddings from word, position and token_type embeddings."""
def __init__(self, config):
super().__init__()
self.word_embeddings = nn.Embedding(config.vocab_size,
config.hidden_size,
padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings,
config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size,
config.hidden_size)
self.LayerNorm = nn.LayerNorm(config.hidden_size,
eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.position_embedding_type = getattr(config,
"position_embedding_type",
"absolute")
self.register_buffer("position_ids",
torch.arange(config.max_position_embeddings).expand((1, -1)))
if version.parse(torch.__version__) > version.parse("1.6.0"):
self.register_buffer(
"token_type_ids",
torch.zeros(self.position_ids.size(), dtype=torch.long),
persistent=False,
)
3. BertEmbedding foward 과정
inputs_embeds = self.word_embeddings(input_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
position_embeddings = self.position_embeddings(position_ids)
embeddings = inputs_embeds + token_type_embeddings + position_embeddings
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
<original full code>
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
token_type_ids: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
past_key_values_length: int = 0,
) -> torch.Tensor:
if input_ids is not None:
input_shape = input_ids.size()
else:
input_shape = inputs_embeds.size()[:-1]
seq_length = input_shape[1]
if position_ids is None:
position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]
# Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs
# when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids, solves
# issue #5664
if token_type_ids is None:
if hasattr(self, "token_type_ids"):
buffered_token_type_ids = self.token_type_ids[:, :seq_length]
buffered_token_type_ids_expanded = buffered_token_type_ids.expand(input_shape[0], seq_length)
token_type_ids = buffered_token_type_ids_expanded
else:
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)
if inputs_embeds is None:
inputs_embeds = self.word_embeddings(input_ids)
token_type_embeddings = self.token_type_embeddings(token_type_ids)
embeddings = inputs_embeds + token_type_embeddings
if self.position_embedding_type == "absolute":
position_embeddings = self.position_embeddings(position_ids)
embeddings += position_embeddings
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
<small questions>
- register_buffer 쓰는 이유?
- pre-trained BERT의 단어임베딩 table 어떻게 확인할 수 있나?
- nn.Embedding과 nn.Linear의 다른 점?
'AI > 파이토치(Pytorch)' 카테고리의 다른 글
[Pytorch][Error] ’BertTokenizerFast' object has no attribute '_in_target_context_manager' (0) | 2022.12.12 |
---|---|
[Pytorch][BERT] 버트 소스코드 이해_⑥ BertEncoder (0) | 2022.09.30 |
[Pytorch][BERT] 버트 소스코드 이해_④ BertModel (0) | 2022.07.05 |
[Pytorch][BERT] 버트 소스코드 이해_③ BertTokenizer (0) | 2022.07.05 |
[Pytorch][BERT] 버트 소스코드 이해_② BertConfig (0) | 2022.07.05 |
댓글