GMB, Gemma Sprint

2024-10-01 2 분 소요

[Notification]
Google Machine Learning Bootcamp 과정에서 Gemma Sprint을 진행했습니다. 제가 SAA라는 프로젝트를 진행했던 과정을 공유하여 LLM을 처음 사용하시는 분들에게 도움이 되거나, 프로젝트를 응용하여 더 좋은 결과물이 나오길 기대합니다.

SAA : SQL Auxiliary Agent

1. Project Description

해당 프로젝트는 최근 출시된 Gemma2 모델을 Fine Tuning하여 Text to SQL에 응용할 수 있도록 진행한 프로젝트 입니다.

입력은 자연어 질의와 context로 이루어져 있으며, 질의와 context는 다음을 의미합니다.

자연어 : SQL query가 아닌 테이블에서 얻고 싶은 정보를 자연어로 표현한 문장
- Ex. How many employees are in the employee information table?

context : 대상이 되는 Tabel의 구조 정보

Ex.

  Create table If Not Exists Products (product_id int, low_fats ENUM('Y', 'N'), recyclable ENUM('Y','N'))
  Truncate table Products
  insert into Products (product_id, low_fats, recyclable) values ('0', 'Y', 'N')
  insert into Products (product_id, low_fats, recyclable) values ('1', 'Y', 'Y')
  insert into Products (product_id, low_fats, recyclable) values ('2', 'N', 'Y')
  insert into Products (product_id, low_fats, recyclable) values ('3', 'Y', 'Y')
  insert into Products (product_id, low_fats, recyclable) values ('4', 'N', 'N')          

이러한 정보를 제공했을 때, 이에 알맞는 SQL query를 Return하는 모델이라고 봐주시면 될 것 같습니다!

2. Data

gretelai/synthetic_text_to_sql
해당 데이터에서 sql_prompt, sql_context, sql 컬럼을 파싱해서 사용했습니다.
제가 사용한 데이터의 예시는 아래와 같습니다.

sql_prompt	sql	sql_context
What is the total volume of timber sold by each salesperson, sorted by salesperson?	SELECT salesperson_id, name, SUM(volume) as total_volume FROM timber_sales JOIN salesperson ON timber_sales.salesperson_id = salesperson.salesperson_id GROUP BY salesperson_id, name ORDER BY total_volume DESC;	CREATE TABLE salesperson (salesperson_id INT, name TEXT, region TEXT); INSERT INTO salesperson (salesperson_id, name, region) VALUES (1, ‘John Doe’, ‘North’), (2, ‘Jane Smith’, ‘South’); CREATE TABLE timber_sales (sales_id INT, salesperson_id INT, volume REAL, sale_date DATE); INSERT INTO timber_sales (sales_id, salesperson_id, volume, sale_date) VALUES (1, 1, 120, ‘2021-01-01’), (2, 1, 150, ‘2021-02-01’), (3, 2, 180, ‘2021-01-01’);
List all the unique equipment types and their corresponding total maintenance frequency from the equipment_maintenance table.	SELECT equipment_type, SUM(maintenance_frequency) AS total_maintenance_frequency FROM equipment_maintenance GROUP BY equipment_type;	CREATE TABLE equipment_maintenance (equipment_type VARCHAR(255), maintenance_frequency INT);
How many marine species are found in the Southern Ocean?	SELECT COUNT(*) FROM marine_species WHERE location = ‘Southern Ocean’;	CREATE TABLE marine_species (name VARCHAR(50), common_name VARCHAR(50), location VARCHAR(50));

3. Model

Base Model은 Gemma2를 활용해 Fine Tuning을 진행했습니다.
개개인의 하드웨어 사양이 다르기에, 4가지 버전으로 파인튜닝을 진행했고 각 모델은 아래의 링크에서 확인하실 수 있습니다.
SAA Model
Merge Version은 Fine Tuning 이후 Gemma2 모델과 Merge하여 일반화 성능을 향상시킨 모델입니다.

4. How to Use?

Fine Tuning 및 Inference 가이드 링크를 첨부드립니다.
Fine Tuning Guide
- Run Enviorment : Kaggle Notebook / GPU T4 X 2
- Train Report :
Inference
- Run Enviorment : Colab / T4 GPU
github

5. Review

학부연구생 시절 주로 Computer Vision, Time-Series Forecasting과 관련한 연구를 수행해왔기에, LLM과 관련하여 관심은 있었지만 쉽사리 접근하지 못해 아쉬움이 많았습니다.
이번 Sprint를 계기로 기본적인 틀을 이해하고 LLM과 관련한 기술들을 함양할 수 있었던 것 같습니다.
다만, 현재 Kaggle과 Colab GPU 사양을 모두 소모해버려서 기존 계획했던 내용을 모두 소화하지 못해 아쉬움이 남습니다.
- LangChain을 활용해 MySQL DB와 Connection 후 쿼리를 적용하여 실제 결과를 Return해주는 Agent System
- Grdio or Stream-lit을 활용한 App 형태 배포
하드웨어 자원이 돌아오고, 예정된 코딩테스트 및 인적성 시험이 끝나면 계획했던 내용을 추가적으로 반영해 Blog에 포스팅 해보겠습니다.

Twitter Facebook LinkedIn

GMB, Gemma Sprint

SAA : SQL Auxiliary Agent

1. Project Description

2. Data

3. Model

4. How to Use?

5. Review

공유하기

댓글남기기

참고

[Lightweight-Sereis] 2. Quantization

[Lightweight-Sereis] 1. Pruning

[Lightweight-Sereis] 0. Intro

ResNet-50