머신러닝/자연어처리

트랜스포머, 자연어처리, pipeline, 감정분류, 개체명인식, 질문답변, 요약, 생성

Olivia-BlackCherry 2024. 8. 22. 17:50

목차

    사전 준비

    # 판다스
    import pandas as pd
    # 경고 메시지 무시 
    import warnings 
    warnings.filterwarnings("ignore") #원복 : default

     

    예시문장
    text = "It was a beautiful, sunny day, and Anna was excited to visit the art museum. The city was bustling, and she felt alive as she walked through the streets. Her friend Sarah canceled last minute, which disappointed her, but she decided to go alone. At the museum, the paintings stirred deep emotions—joy, sadness, and reflection. After a peaceful coffee break, she witnessed a kind moment when a child spilled dishes. As the sun set, Anna felt content, reminded that emotions come and go, each one meaningful."

     

     

    1. pipeline 함수

    - 특정 NLP 작업을 위해 미리 학습된 모델을 손쉽게 사용하도록 돕는다.
    - 다양한 NLP 작업 지원함
        - text-classification
        - question-answering
        - translation
        - summarization
    - 복잡한 모델 로딩과 예측 과정을 간단하게 처리해주는 도구

     

     

     

    1-1 감정분류

    from transformers import pipeline
    classifier = pipeline("text-classification")
    output = classifier(text)
    pd.DataFrame(output)

     

     

     

    1- 2. 개체명 인식

    • NER(Named Entity Recognition)
    • 텍스트에서 사람(PER), 위치(LOC), 조직(ORG) 등의 특정 개체 식별하는 작업
    • aggregation_strategy="simple" : NER 파이프라인이 동일한 엔티티에 해당하는 연속된 토큰을 하나로 묶어 간단하게 표현함 ex) New york City => 하나의 LOC 엔티티
    • aggregation_strategy= "none" : 개별적 결과로 반환
    • 이외에도 first, max, average 가 있음
    ner_tagger = pipeline("ner", aggregation_strategy="simple")
    outputs = ner_tagger(text)
    pd.DataFrame(outputs)

     

     

    1-3 질문답변

    reader = pipeline("question-answering")
    question = "What emotions did Anna experience while looking at the paintings in the museum?"
    outputs = reader(question=question, context=text)
    pd.DataFrame([outputs])

     

     

    1-4 번역

    pip install sentencepiece
    pip install protobuf

     

    # 전부다 실패

    영어-한국어 번역 transformer 모델로 다 실패했다.

    이유는(?)

    from transformers import pipeline
    
    # mT5 모델을 사용한 영어-한국어 번역
    translator = pipeline("translation", model="google/mt5-small", tokenizer="google/mt5-small", max_length = 400)
    
    text = "It was a beautiful, sunny day, and Anna was excited to visit the art museum."
    translated_text = translator(text)
    print(translated_text)
    
    # helsinki-nlp 모델을 사용한 영어-한국어 번역
    translator = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-en-ko")
    
    outputs= translator(text, clean_up_tokenization_spaces=True, max_length=1000)
    print(outputs[0]['translation_text'])
    
    
    from transformers import pipeline
    translator = pipeline("translation_en_to_ko", model="Helsinki-NLP/opus-mt-tc-big-en-ko")
    text = "It was a beautiful, sunny day, and Anna was excited to visit the art museum."
    translated_text = translator(text)
    print(translated_text)

     

    # 다른 언어로의 번역은 잘됨

    translator = pipeline("translation_en_to_ko", model="Helsinki-NLP/opus-mt-en-de")
    outputs= translator(text, clean_up_tokenization_spaces=True, max_length=1000)
    print(outputs[0]['translation_text'])

     

     

     

    1-5 텍스트 생성

    generator = pipeline("text-generation")
    response = "Mom, I am so sorry."
    prompt = text + response
    outputs = generator(prompt, max_length= 200)
    print(outputs[0]['generated_text'])

     

     

    1-6 요약

    text = "It was a beautiful, sunny day, and Anna was excited to visit the art museum. The city was bustling, and she felt alive as she walked through the streets. Her friend Sarah canceled last minute, which disappointed her, but she decided to go alone. At the museum, the paintings stirred deep emotions—joy, sadness, and reflection. After a peaceful coffee break, she witnessed a kind moment when a child spilled dishes. As the sun set, Anna felt content, reminded that emotions come and go, each one meaningful."
    summarizer = pipeline("summarization")
    outputs = summarizer(text, max_length=100, clean_up_tokenization_spaces =True)
    print(outputs[0]['summary_text'])

     

     

    트랜스포머, 자연어처리, pipeline, 감정분류, 개체명인식, 질문답변, 요약, 생성