Ziyu Li's Homepage

Back

CS189 Assignment 2#

项目描述#

这个lab主要是有关于prompt工程的, 我们需要分析prompt的内容和分布, 以及改变prompt会对模型的回答好坏产生什么影响

LMArena#

LMArena是一个大模型对战平台, 用户可以在里面问问题, 然后两个匿名模型会给出回答, 用户可以根据回答来投票觉得哪个模型回答得更好, 回答之后才会告诉用户两个模型分别是什么

Hugging Face#

Hugging Face是一个AI社区, 里面有很多开源的模型和数据集, 在这个项目当中我们会从这里获取一些数据集, 还会用到他的一些可视化工具(例如Gradio)

记得去注册一个Hugging Face账号, 因为需要登录才能获取数据

下载数据#

直接运行给出的代码就行

数据描述#

字段分析#

我们拿到的数据是大模型的对战数据, 也就是说每行数据描述了一个用户向两个大模型提出问题并且决定谁胜出的这一过程

print("Columns:\n*", "\n* ".join(battles.columns))
python

比较重要的字段是model_a, model_b, conversation_a, conversation_b, winner, 其中

model_amodel_b是两个大模型的名字 — conversation_aconversation_b是用户提出的问题和大模型的回答 — winner是用户认为哪个模型回答得更好

example = battles.iloc[4]
print(f"Conversation A (model = {example['model_a']}):")
print(example['conversation_a'])
print(f"Conversation B (model = {example['model_b']}):")
print(example['conversation_b'])
print("Category Tag:")
print(example['category_tag'])
python
Conversation A (model = mixtral-8x22b-instruct-v0.1):
[{'content': 'Which number id bigger 9.11 or 9.9 ?', 'num_tokens': 14, 'role': 'user'}
 {'content': 'The number 9.9 is bigger than 9.11. In decimal numbers, we compare them digit by digit from left to right. Both numbers start with the same digit, 9. However, the second digit of 9.9 is also 9, while the second digit of 9.11 is 1. Since 9 is greater than 1, 9.9 is greater than 9.11.', 'num_tokens': 89, 'role': 'assistant'}]
Conversation B (model = llama-3.1-70b-instruct):
[{'content': 'Which number id bigger 9.11 or 9.9 ?', 'num_tokens': 14, 'role': 'user'}
 {'content': '**Comparing the Numbers**\n\nTo determine which number is bigger, we can compare the numbers 9.11 and 9.9.\n\n**Step-by-Step Comparison**\n\n1. Both numbers have the same integer part: 9.\n2. We will compare the decimal parts: 0.11 and 0.9.\n3. Since 0.9 is greater than 0.11, we can conclude that 9.9 is greater than 9.11.\n\n**Result**\n\n9.9 is bigger than 9.11.', 'num_tokens': 112, 'role': 'assistant'}]
Category Tag:
{'criteria_v0.1': {'complexity': True, 'creativity': False, 'domain_knowledge': True, 'problem_solving': False, 'real_world': True, 'specificity': True, 'technical_accuracy': True}, 'if_v0.1': {'if': False, 'score': 1.0}, 'math_v0.1': {'math': True}}
plaintext

注意每个conversation是一个嵌套结构的列表, 第一个dict描述用户提出的问题, 第二个dict描述大模型的回答

画一个图描述winner的分布

battles.winner.hist(title="Counts of Battle Outcomes", text_auto=True)
python
label_plot

模型计数#

我们想看一下在这份数据当中, 每个大模型总共参加了多少次”战斗”, 注意大模型的名字只出现在model_amodel_b这两个字段当中, 所以只需要对这两个字段进行concat然后计数即可

fig = pd.concat([battles["model_a"], battles["model_b"]]).value_counts().plot.bar(title="Battle Count for Each Model", text_auto=True)
fig.update_layout(xaxis_title="Model", yaxis_title="Battle Count", height=400, showlegend=False)
fig
python
label_plot

可以看到一些比较有名的大模型(例如claude3.5, gemini, gpt, llama)出现的比较多, count排名在后面的模型我都没太听说过

Problem 1a#

挑出参与PK最多的前20个模型, 直接复用上面的代码即可

#TODO
df=pd.concat([battles['model_a'],battles['model_b']])
df=df.value_counts()
df=df.head(20).index.tolist()

selected_models = df
python

Problem 1b#

做一个筛选, 只把前面TOP20模型参与的PK筛选出来, 并且结果不能是tie

用一个bool序列会比较好处理

from typing import Tuple, Set
import pandas as pd

def subselect_battles(
    battles: pd.DataFrame, 
    selected_models: Set[str]
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    # Filters the battles DataFrame to only include battles between the selected models.
    # Returns a tuple of the dataframe filtered by models and the dataframe filtered my models with ties removed.
    cond=(battles['model_a'].isin(selected_models) & battles['model_b'].isin(selected_models))
    selected_battles=battles[cond]
    selected_battles_no_ties=selected_battles[~selected_battles['winner'].astype(str).str.contains('tie')]
    return selected_battles, selected_battles_no_ties
  
selected_battles, selected_battles_no_ties = subselect_battles(battles, selected_models)
python

课程给了一个visualize_battle_count函数, 这是来画热力图的, 来看一些例子:

画出TOP20模型之间的PK次数热力图

visualize_battle_count(selected_battles, title="Battle Count of Each Combination of Models", show_num_models=30)
plaintext
label_plot

可以看到pk最多的是claude, gemini, gpt, 印象当中这也是三个最火的大模型了

再来看刚才把平局去掉了的PK画出的热力图

visualize_battle_count(selected_battles_no_ties, "Battle Count for Each Combination of Models (without Ties)")
plaintext
label_plot

分布和之前的没什么区别

Problem 1c#

先来回答一个有意思的问题, 为什么上面的热力图说明强模型和强模型(例如gpt vs claude)之间的PK次数要比强模型和弱模型之间的多得多?

Q: Why might LMArena pair strong models vs. strong models more often than strong vs. smaller models? 
Think about statistical power, how quickly you can differentiate models, and ranking uncertainty among top-tier systems.
plaintext
YOUR ANSWER: People are tend to use strong models to test other models. Meanwhile, we can easily predict the strong model will always beat weak model so it's hard to differentiate the strong models.
plaintext

统计一下问题的语言, 数据中的问题是用很多语言写的, 不过猜也猜得到肯定英语占绝大多数

语言的字段通过battles['language']获取

lang_counts_all = battles["language"].value_counts()

fig_lang_all = px.bar(
    lang_counts_all,
    title="Distribution of Languages",
    text_auto=True,
    height=400
)
fig_lang_all.update_layout(
    xaxis_title="Language",
    yaxis_title="Count",
    showlegend=False
)
fig_lang_all.show()
python
label_plot

其实令我意外的是俄语比中文多, 不知道是不是这个数据集有意少收集了中文的对话数据

再把数据的turn字段统计一下, 这个字段描述了用户提出了几轮问题, 比如说turn = 1就是用户问了一个问题, turn = 2就是用户问了一个问题, 模型回答一次, 然后再问了一次, 以此类推

fig = px.histogram(battles["turn"],
             title=f"Number of Conversation Turns",
             text_auto=True, height=400, log_y = True)
fig.update_layout(xaxis_title="Turns", yaxis_title="Count", showlegend=False)
fig.update_traces(marker_line_color='black', marker_line_width=1)
fig
python
label_plot

大部分对话都是首轮提问

Gradio#

Gradio是一个集成在Hugging Face Spaces里面的可视化工具, 可以快速把模型的效果画出来, 并且部署到HF上

Problem 2a#

需要计算对于所有的(model_a, model_b)组合, 总的PK数量和model_a胜出的数量

遍历只需一次, 需要准备两个dict, 第一个win_counts记录{(mode_a, model_b): aPKb_count}, 第二个win_counts记录{(mode_a, model_b): model_a胜出的次数}

注意这个row_beats_col, 他的每个元素的行是model_a, 列是model_b, 取win_count[(model_a,model_b)]/total_counts[(model_a,model_b)]表示在所有a和b的PK当中, a胜出的比例

显然对角元应该是np.nan, 因为数据集当中没有自己和自己pk的情况

Problem 2b#

直接运行他给好的画图代码即可, 他会根据我们刚刚生成的胜率矩阵来画图

label_plot

从这个图上可以看出, 弱的模型打强的模型的胜率十分低下, 比如claude-3-5-sonnet-20240620chatgpt-4o-latest的胜率只有0.15, 甚至可能那赢的0.15都是用户带有明显个人偏好投出来的票

Problem 2c#

和上面一样, 跑他的画图代码把平均胜率画出来就行

解释一下平均胜率的意思, 由于每个元素代表着行模型vs列模型的胜率, 所以只需要每行求一个平均值即可

当然, 如果想玩点花活也可以用1减去所有的元素, 得到列模型vs行模型的胜率, 然后每列求一个平均值, 结果应该是一样的

label_plot

可以看到gpt, gemini-pro等几个著名模型胜率比较高, 图的结果是符合常识的

还有几个问题回答一下, 比较有意思的是第三个问题: 为什么胜率并不一定代表模型的强弱? 其实这就是个炸鱼和被炸鱼的关系, 对于不那么强的模型(钻石模型翡翠模型), 如果他匹配到青铜白银多一点, 也许比宗师王者互相匹配的胜率还高, 反过来黄金白金如果一直匹配到钻石大师那自然胜率低, 可能还不如下面青铜白银互相匹配的胜率

1. Identify at least two models that appear to have similar average win rates.
2. Compare their parameter sizes of each model (you can google their parameter counts, and if its a closed source model which does not list its parameter size assume its >100 billion parameters). Is there a relationship between parameter size and performance? Are there any models which stick out as unusally good or bad for its size?
3. Why might computing win-rate in this manner misrepresent model strength? Think about the distribution of battles per model pair and how an imbalance in model pairing counts could result in one model ranking high or lower than it should. 
plaintext
YOUR ANSWER:
1. gpt-4o-mini-2024-07-18 and claude-3.5-sonnet-20240620
2. Both over 100B parameters
3. Maybe some strong models are paired with weak models so the win rate is high while the other model are paired with strong models so the win rate is low. But actually the latter strong model is better than the former ones.
plaintext

Problem 3#

对Prompt进行分析, 首先从conversation_a当中把第一段话提取出来(User问的问题), 然后筛选出满足以下两个条件的行:

— 语言是英语或者unknownmodelselected_models(前面的TOP20)当中

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD

def first_user_text(conv):
    return (conv[0].get("content") or "").strip()

battles['prompt'] = battles['conversation_a'].apply(first_user_text).fillna("")
eng_battles = battles[(battles['language'] == 'English') | (battles['language'] == 'unknown')]
eng_battles, eng_battles_no_ties = subselect_battles(eng_battles, selected_models)
python

有些prompt会重复很多次, 比如说”hi”, “hello”之类的, 我们先从prompt字段当中找一些出现频繁的, 计算一下他们的概率

# Print the top 10 most common prompts along with their count and percentage of total prompts
top_prompts = eng_battles["prompt"].value_counts().head(10)
for i, (prompt, count) in enumerate(top_prompts.items(), 1):
    print(f"Rank {i}: {count} samples ({round(100 * count/len(eng_battles), 2)}%)\n{prompt}\n")

# print the total percentage of prompts that are 1 of the top 10 prompts
top_10_percentage = sum(top_prompts) / len(eng_battles)
print(f"Total percentage of prompts that are 1 of the top 10 prompts: {round(100 * top_10_percentage, 2)}%")
python

可以看到这都是些口头用语, 这个TOP10的prompt占了总prompt的3.29%

Problem 3a#

显然这些TOP10的prompt都是一些口头语, 他们的PK结果并不能说明什么, 所以我们应该把他们去掉

# remove top prompts top_prompts from eng_battles_no_ties
eng_battles_no_ties_no_top_prompts = eng_battles_no_ties[~eng_battles_no_ties["prompt"].isin(top_prompts.index)]

pairwise_win_rate, fig = get_pairwise_win_fraction_plot(eng_battles_no_ties_no_top_prompts, title="Average Win Rate Against All Other Models (No Top Prompts)")
fig.show()
python
label_plot

去掉之后这个排名是发生了一些改动的, 也许有的模型只是在预训练当中对这些prompt训练的非常好, 拿掉之后他们就下滑了

Problem 3b#

is_code字段和category_tag字段当中给出了更多的信息, 通俗来说是一些条件, 被分为True和False, 我们现在要计算每个条件的正确率

先看一下条件的样子

eng_battles_no_ties['category_tag']
python
3         {'criteria_v0.1': {'complexity': False, 'creat...
31        {'criteria_v0.1': {'complexity': True, 'creati...
32        {'criteria_v0.1': {'complexity': False, 'creat...
38        {'criteria_v0.1': {'complexity': False, 'creat...
41        {'criteria_v0.1': {'complexity': False, 'creat...
                                ...                        
106077    {'criteria_v0.1': {'complexity': False, 'creat...
106091    {'criteria_v0.1': {'complexity': False, 'creat...
106092    {'criteria_v0.1': {'complexity': False, 'creat...
106113    {'criteria_v0.1': {'complexity': True, 'creati...
106128    {'criteria_v0.1': {'complexity': False, 'creat...
Name: category_tag, Length: 15277, dtype: object
plaintext

首先有一个一层key, 这里是criteria_v0.1, value是一个字典, 里面阐述了各种的条件是True还是False, 题目非常贴心的帮我们把每个条件转化成了bool序列

# LMArena also provides more detailed category labels inside the columns `is_code`, `is_refusal`,
# and the nested `category_tag` column.
# We have already extracted the following boolean Series for you:

# GIVEN (do not modify)
expected_creative = eng_battles_no_ties['category_tag'].apply(lambda x: x['criteria_v0.1']['creativity'])
expected_tech = eng_battles_no_ties['category_tag'].apply(lambda x: x['criteria_v0.1']['technical_accuracy'])
expected_if = eng_battles_no_ties['category_tag'].apply(lambda x: x['if_v0.1']['if'])
expected_math = eng_battles_no_ties['category_tag'].apply(lambda x: x['math_v0.1']['math'])
expected_code = (eng_battles_no_ties['is_code'] == True)

# Task:
# 1) Make a bar chart showing these proportion for each category (i.e. the fraction of battles where the category is True)
# 2) Compute the pairwise win rate for each category and use the plot_category_rank_heatmap to visualize the results
python

先做第一件事情, 把图画出来, 既然每个条件已经有一个bool列了, 那直接对这个列求mean就是正确率了, 然后再把它变成两列['category','proportion']DataFrame然后plot出来就行了

label_plot

可以看到大概13%的问题是关于数学的, 22%的问题是关于编程的

接下来要做一个分类, 我们之前得到过一个胜率矩阵(row model vs column model), 接下来要做的事情就是根据这个category去filter, 对于每一个category, 筛选出这个category类型的PK, 然后计算这些PK当中model的胜率排名

如果什么都不过滤, 就把它叫做overall分类, 显然此时的胜率矩阵就应该是我们之前计算出来的那一个, 计算胜率的函数是get_pairwise_win_fraction_plot

label_plot

当筛选不同的问题种类的时候, 模型的表现也不一样, 比如说gemini-1.5-pro-exp-0801在整体上表现出色, 但是论编程他只有第六名了

Problem 3c#

看起来非常task很多, 但是其实只要做三件事情

—遍历一些K做K-Means聚类 —画出Elbow Plot —用SVD把聚类结果降维到2维然后画出散点图

首先看看这个聚类函数输出什么

后面直接用一个for循环去找K就行, 这里我找出来是12

label_plot label_plot

Problem 3d#

只需要叙述一下聚类的观察结果就好, 比如说我这里就发现类0主要都是代码类的prompt

In 2–3 sentences, use the labels from your clustering to do the following:
1. Briefly explain the type(s) of question you see in one of the distinct/similar clusters.
2. If you don’t see a clear pattern, describe likely limitations and potential improvements we can make.
plaintext
YOUR ANSWER:
I can see clearly from cluster 0 that the prompts mainly contains code.
plaintext

Problem 3e#

一个开放性问题, 要求我们找出一个筛选条件, 筛选之后对模型胜率的影响要比较大, 这里我就直接选择prompt里不含有:

import pandas as pd
plaintext

就可以了

label_plot
UC Berkeley CS189 Assignment 2(Part 1)
https://astro-pure.js.org/blog/cs189_assignment2_part1
Author Ziyu(Albert) Li 李子煜
Published at February 4, 2026
Comment seems to stuck. Try to refresh?✨