Fine-tuning OpenAI's GPT with Japanese Prime Minister's speech in the Diet

2023-09-11 python machinelearning llm

OpenAI provides APIs to create conversation and convert text to vector, and also fine-tune models with your own dataset, which enables us to improve the quality of the output or save the cost of few-shot learning.

(PS: 2025-03-08) In v1.0.0, interface was changed to be called from client = OpenAI(). The code below is for the previous version.

# !pip install openai
import openai
import numpy as np

response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[{
    "role": "user", # "system", "user", or "assistant"
    "content": "Is this a pen?"
  }],
  temperature=0.5,
)

print(response.choices[0].message)
'''
{
  "role": "assistant",
  "content": "As an AI, I don't have the ability to see or perceive objects. Therefore, I cannot determine if something is a pen or not based on visual cues."
}
'''

embedding = openai.Embedding.create(
  model="text-embedding-ada-002", # https://platform.openai.com/docs/guides/embeddings/embedding-models
  input=response.choices[0].message.content
)

print(embedding.data[0].embedding[:5]) # [-0.02444140985608101, -0.015481687150895596, 0.009029921144247055, 0.0035672972444444895, -0.006072063464671373]
print(len(embedding.data[0].embedding)) # 1536

# https://platform.openai.com/docs/guides/embeddings/limitations-risks
# https://github.com/openai/openai-python/blob/5d50e9e3b39540af782ca24e65c290343d86e1a9/openai/embeddings_utils.py#L65C1-L65C1
def cosine_similarity(a, b):
  return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

cosine_similarity(embedding.data[0].embedding, embedding.data[0].embedding) # 1.0

Ready for the dataset

Collect questions and answers to the Prime Minister or his deputy with API for the Japanese Diet Minutes Search System, output them in the format for fine-tuning. There are very long questions and answers in the speeches, but they are excluded mainly for the sake of cost of learning. I noticed later that I could get them in JSON.

# !pip install beautifulsoup4
import requests
from bs4 import BeautifulSoup
import re
import json
from datetime import date, timedelta

p = re.compile('〔[^〔〕]*〕')
with open("fine_tune.jsonl", "w") as file:
  day_until = date(2021, 12, 31)
  for i in range(6):
    day_from = day_until + timedelta(days=1)
    day_until = day_from + timedelta(days=30)
    print(f"{day_from.strftime('%Y-%m-%d')} - {day_until.strftime('%Y-%m-%d')}")
    params = {
      'any': 1,
      'maximumRecords': 10,
      'from': day_from.strftime('%Y-%m-%d'),
      'until': day_until.strftime('%Y-%m-%d')
    }
    response = requests.get('https://kokkai.ndl.go.jp/api/meeting', params=params)
    data = response.text

    soup = BeautifulSoup(data, 'html.parser')
    speechrecords = soup.find_all('speechrecord')
    for i, record in enumerate(speechrecords):
      if record.speaker.text == '会議録情報':
        continue
      # 内閣総理大臣臨時代理・厚生大臣, 副議長 etc.
      if record.speakerposition.text.startswith('内閣総理大臣')
              and i != 0 and not speechrecords[i-1].speakerposition.text.endswith('議長'):
        messages = [{'role': 'system', 'content': 'あなたは総理大臣です。'}]
        for j in reversed(range(2)):
          if len(speechrecords[i-j].speech.text) > 200:
            break
          role = ['assistant', 'user'][j%2]
          text = '　'.join(speechrecords[i-j].speech.text.split('　')[1:]) # remove speaker name
          text = p.sub('', text) # remove actions such as〔賛成者起立〕
          text = ''.join(text.split()).replace('――――◇―――――', '') # remove line breaks and etc.
          messages.append({"role": role, "content": text})
        if len(messages) == 3:
          file.write(json.dumps({"messages": messages}, ensure_ascii=False) + '\n')

This file contains 191 lines of data like the following. At least 10 lines are required.

{"messages": [{"role": "system", "content": "あなたは総理大臣です。"}, {"role": "user", "content": "れいわ新選組、大阪五区、大石あきこです。先日、大阪のカジノの是非を問う住民投票署名が二十万筆を超え、法定数を大きく上回りました。住民合意なき国のカジノ認可はあり得ないと、冒頭申し上げます。さて、岸田総理、先週の予算委員会で、消費税は減税しないと明言されていました。もう一回確認します。消費税は減税しないでよろしいんですね。イエスかノーかでお答えください。"}, {"role": "assistant", "content": "消費税減税は考えておりません。社会保障の安定財源として重要である、また、システムの変更等を伴うことを考えますと、コストあるいは準備期間、大きなものが必要になります。機動的に対応することは不可能であると思っています。"}]}
{"messages": [{"role": "system", "content": "あなたは総理大臣です。"}, {"role": "user", "content": "では、換気している広いオフィスに同僚と二人、話さずにパソコン作業しています。このときはマスクを外してよろしいですか。"}, {"role": "assistant", "content": "屋内においては、人との距離二メートル以上を目安が確保でき、かつ会話をほとんど行わない場合、マスクを着用する必要はないという考え方を政府は示しています。委員の今の例についても、今の二つの条件共にクリアしているケースではないかと理解をいたします。"}]}
{"messages": [{"role": "system", "content": "あなたは総理大臣です。"}, {"role": "user", "content": "念のために伺いますが、岸田総理、共同声明の中に、日銀が金融緩和を進める進めない、金融緩和という文言があるか、あるいは、それ進めるかという表現があるか御存じですか。"}, {"role": "assistant", "content": "御質問は、金融緩和という言葉があるかという御質問でありましたが、今、金融緩和という言葉そのものが共同声明の中にあるかどうか今ちょっと確認をしているところですが、それについては今はちょっと確認はできておりません。"}]}

Fine-tuning

Upload the file and call the API for fine-tuning with OpenAI Python Library.

import openai
import time

with open("fine_tune.jsonl", "r") as file:
  upload = openai.File.create(
    purpose="fine-tune",
    file=file,
  )

while upload.status != "processed": # uploaded, processed, pending, error, deleting or deleted
  time.sleep(5)
  upload = openai.File.retrieve(id=upload.id)
  print(f"file status: {upload.status}")

job = openai.FineTuningJob.create(
  model="gpt-3.5-turbo",
  suffix='souri',
  training_file=upload.id
)

while job.status != "succeeded": # created, running, succeeded, failed, or cancelled
  time.sleep(10)
  job = openai.FineTuningJob.retrieve(id=job.id)
  print(f"fine-tunes job status: {job.status}")

After the job is completed, you will receive an email, but you can also check the status of the job with the CLI. It took about 22 minutes.

$ openai api fine_tuning.job.list                                                    
{
  "object": "fine_tuning.job",
  "id": "ftjob-SXe8mDlviNBoA89JbeVKJHBi",
  ... 
  "status": "succeeded",
  "validation_file": null,
  "training_file": "file-wRmOG1oE8SEapJnB0cnjY0K5",
  "hyperparameters": {
    "n_epochs": 3
  },
  "trained_tokens": 137028,
  "error": null
}
...

$ openai api fine_tuning.job.events -i ftjob-SXe8mDlviNBoA89JbeVKJHBi | jq ".message"
"The job has successfully completed"
"New fine-tuned model created: ft:gpt-3.5-turbo-0613:personal:souri:7xTn9E3u"
"Step 570/573: training loss=1.28"
"Step 560/573: training loss=0.65"
"Step 550/573: training loss=1.19"
...

Chat conversation

You can call the API, but Playground is convenient for trying out models.

First, I started a conversation with the original model with a “You are the Prime Minister” system prompt, but it answered the general questions as usual.

Whereas, fine-tuned model successfully answered like a politician. Short sentences were returned, probably because the length of dataset sentences was limited.

OpenAI Evals can be used to evaluate the model.