[Pytorch][BERT] 버트 소스코드 이해

[Pytorch][BERT] 버트 소스코드 이해 목차
BERT	📑 BERT Config
	📑 BERT Tokenizer 👀
	📑 BERT Model	📑 BERT Input
		📑 BERT Output
		📑 BERT Embedding
		📑 BERT Pooler
		📑 BERT Enocder	📑 BERT Layer	📑 BERT SelfAttention
				📑 BERT SelfOtput

BertTokenizer

1. BertTokenizer의 이해

Tokenizer 정의: 주어진 코퍼스(corpus)에서 토큰(token)이라 불리는 단위로 나누는 작업
BertTokenizer는 무엇이 특별한가?
- WordPiece Tokenizer(BPE의 변형 알고리즘) 적용
- BPE(Byte Pair Encoding): OOV(Out-Of-Vocabulary) 문제를 완화하기위한 대표적인 서브워드 분리 알고리즘
- 서브워드 분리(Subword segmenation) : 하나의 단어는 더 작은 단위의 의미있는 여러 서브워드들(Ex) birthplace = birth + place)의 조합으로 구성된 경우가 많기 때문에, 하나의 단어를 여러 서브워드로 분리해서 단어를 인코딩 및 임베딩하겠다는 의도를 가진 전처리 작업
- (참고자료) https://wikidocs.net/22592

2. BertTokenizer의 input과 output

✔ input

: 텍스트 (List[str])

(예시) sequence= ["apple people water", "people apple water", "water apple people"]

✔ output

: 토큰화 결과 (Dict)

(예시)

{'input_ids': tensor([[ 101, 6207, 2111, 2300, 102], [ 101, 2111, 6207, 2300, 102], [ 101, 2300, 6207, 2111, 102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]),

'attention_mask': tensor([[1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]) }

2.2 BertTokenizer의 output 해부

✔ input_ids

: 각 토큰에 대한 정수 인코딩

: (torch.LongTensor of shape ({0})

: Indices of input sequence tokens in the vocabulary.

: Indices can be obtained using [BertTokenizer].

: See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

sequence= ["apple people water"]
inputs = tokenizer(sequence, return_tensors="pt")
  
outputs = model(**inputs)
input_ids = inputs['input_ids'] 
# tensor([[ 101, 6207, 2111, 2300,  102]])

tokenizer.decode(input_ids)
# [CLS] apple people water [SEP]

✔ token_type_ids (= segment_ids)

: pre-training 단계에서 ‘NSP(Next Sentence Prediction)’ task를 위해 존재

: fine-tuning 시, 모두 0 (https://ratsgo.github.io/nlpbook/docs/language_model/tutorial/)

(torch.LongTensor of shape ({0}), optional)

: Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]:

0 corresponds to a sentence A token,
1 corresponds to a sentence B token.

sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"

encoded_dict = tokenizer(sequence_a, sequence_b)
encoded_dict["token_type_ids"]
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

decoded = tokenizer.decode(encoded_dict["input_ids"])
# [CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]

✔ attention_mask

: [이유] attention 금지!

→ BERT의 경우 Inference시에는 zero padding으로 입력된 토큰에 대해서는

attention score을 받지 못하도록 마스킹 처리

: ‘0’ - 단어 토큰 / ‘1’ - padding 된 토큰

(torch.FloatTensor of shape ({0}), optional)

: Mask to avoid performing attention on padding token indices.

Mask values selected in [0, 1]:

1 for tokens that are not masked,
0 for tokens that are masked.

'AI > 파이토치(Pytorch)' 카테고리의 다른 글

[Pytorch][BERT] 버트 소스코드 이해_⑤ BertEmbedding (0)	2022.07.06
[Pytorch][BERT] 버트 소스코드 이해_④ BertModel (0)	2022.07.05
[Pytorch][BERT] 버트 소스코드 이해_② BertConfig (0)	2022.07.05
[Pytorch][BERT] 버트 소스코드 이해 (1)	2022.07.05
[파이토치] 미니배치와 데이터 로드 하기 (0)	2021.09.16

Hyen4110

[Pytorch][BERT] 버트 소스코드 이해_③ BertTokenizer