Top Research Articles of the Week
Top Research Articles of the Week
Title: Task Contamination: Language Models May Not Be Few-Shot Anymore
Summary:
- Language models (LLMs) can exhibit inflated performance in zero-shot and few-shot tasks due to task contamination, where the model has been exposed to the task during pre-training.
- The performance of LLMs on datasets released prior to their training data creation date is significantly better than on datasets released after, indicating the presence of task contamination.
- LLMs rarely show statistically significant improvements over simple majority baselines in classification tasks without task contamination.
- Training data inspection and task example extraction methods have low recall in detecting task contamination, highlighting the difficulty in identifying contaminated data.
- The increase in performance of LLMs over time is likely due to task contamination.
- Membership inference attacks can reveal the extent of task contamination in LLMs, with a strong correlation between the number of exact matched generated examples and accuracy.
- Caution should be exercised when using closed-sourced LLMs as baselines in zero-shot and few-shot evaluation, especially those with instruction fine-tuning or reinforcement learning with human feedback.
Link
Title: Large Language Models as Traffic Signal Control Agents: Capacity and Opportunity
Summary:
- The paper introduces LLMLight, a framework that uses large language models (LLMs) for traffic signal control tasks.
- LLMLight leverages the generalization and zero-shot reasoning capabilities of LLMs to make human-like decisions for efficient traffic management.
- LLMLight achieves state-of-the-art or competitive results across five real-world traffic datasets, showcasing its generalization, interpretability, and zero-shot reasoning abilities.
- The experiments show that RL-based models struggle with limited generalization ability and lack interpretability, while LLMLight consistently performs well without any training for transportation management tasks.
- The paper suggests future research directions in integrating LLMs with RL-based methods, multi-intersection traffic signal control, and developing LLMs with domain-specific expertise in traffic management.
Link
Title: Harnessing Pre-trained Generalist Agents for Software Engineering Tasks
Summary:
- Bug Detection in Blockmaze Game:
We fine-tuned the pre-trained generalist agents on the Blockmaze game and evaluated their performance in terms of bug detection. The results are shown in Table 6. The baseline agent detected an average of 1.8 bugs of Type 1 and 2 bugs of Type 2. Among the fine-tuned generalist agents, the MGDT-MAENT configuration detected an average of 1 bug of Type 2, while the MGDT-DQN configuration detected an average of 5 bugs of Type 1 and 5 bugs of Type 2. The MGDT-PPO configuration did not detect any bugs. Overall, the fine-tuned generalist agents showed comparable or better bug detection performance than the baseline agent.
Bug Detection in MsPacman Game:
We also fine-tuned the pre-trained generalist agents on the MsPacman game and evaluated their bug detection performance. The results are shown in Table 7. The baseline agent detected an average of 772 bugs of Type 1, 689 bugs of Type 2, 675 bugs of Type 3, and 689 bugs of Type 4. Among the fine-tuned generalist agents, the MGDT-MAENT configuration detected an average of 200 bugs of Type 3 and 59 bugs of Type 4, while the MGDT-DQN configuration detected an average of 20 bugs of Type 3 and 11 bugs of Type 4. The MGDT-PPO configuration did not detect any bugs. Overall, the fine-tuned generalist agents showed comparable or better bug detection performance than the baseline agent.
Makespan in PDR-based Scheduling:
For the PDR-based scheduling task, we fine-tuned the pre-trained generalist agents and evaluated their performance in terms of makespan. The results are shown in Table 8. The baseline agent achieved a makespan of 573 for the (6x6) instance and 2480 for the (30x20) instance. Among the fine-tuned generalist agents, the MGDT-MAENT configuration achieved a makespan of 391 for the (6x6) instance and 2019 for the (30x20) instance, while the MGDT-DQN configuration achieved a makespan of 392 for the (6x6) instance and 2012 for the (30x20) instance. The MGDT-PPO configuration achieved a makespan of 200 for the (30x20) instance. The IMPALA-V TRACE and IMPALA-PPO configurations achieved makespans of 201 and 182 for the (6x6) instance respectively. Overall, the fine-tuned generalist agents showed comparable or better makespan performance than the baseline agent.
Cumulative Reward, Training Time, and Testing Time:
We also evaluated the fine-tuned generalist agents in terms of cumulative reward, training time, and testing time. The results are shown in Tables 9, 10, and 11.
For the MsPacman game, the baseline agent achieved an average cumulative reward of 8140. Among the fine-tuned generalist agents, the MGDT-MAENT configuration achieved an average cumulative reward of 200, the MGDT-DQN configuration achieved an average cumulative reward of 150, and the MGDT-PPO configuration achieved an average cumulative reward of 45. For the Blockmaze game, the baseline agent achieved an average cumulative reward of 43200. Among the fine-tuned generalist agents, the MGDT-MAENT configuration achieved an average cumulative reward of -297, the MGDT-DQN configuration achieved an average cumulative reward of -149,
Link
Title: A Prompt Learning Framework for Source Code Summarization
Summary:
- RQ4: Influence of training data size on PromptCS.
To investigate the impact of training data size on PromptCS, we conduct experiments using different proportions of the training set for training. Specifically, we randomly select 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, and 90% of the training set and use them to train PromptCS. The remaining 10% of the training set is used as the validation set. We use StarCoderBase-1B as the base LLM. Table 7 presents the experimental results.
From Table 7, it can be observed that as the size of the training data increases, the performance of PromptCS generally improves. This is consistent across all four evaluation metrics. However, the improvement becomes less significant as the training data size increases. For example, when using 10% of the training data, the BLEU score is 19.33, and when using 90% of the training data, the BLEU score is 19.62. The difference in BLEU scores between 10% and 90% is less than 0.3. This suggests that PromptCS can achieve good performance even with a relatively small amount of training data.
Summary: Increasing the size of the training data generally improves the performance of PromptCS. However, the improvement becomes less significant as the training data size increases. PromptCS can achieve good performance even with a relatively small amount of training data.
- RQ5: Performance of PromptCS on code summarization tasks in other programming languages.
To evaluate the performance of PromptCS on code summarization tasks in other programming languages, we conduct experiments on three additional languages from the CSN corpus: Go, JavaScript, and PHP. We use the same experimental settings as before and report the results in Table 8.
From Table 8, it can be observed that PromptCS performs well on code summarization tasks in other programming languages. It achieves competitive results compared to the baselines in terms of all four evaluation metrics. This demonstrates the effectiveness and generalizability of PromptCS across different programming languages.
Summary: PromptCS performs well on code summarization tasks in other programming languages, achieving competitive results compared to the baselines. This demonstrates the effectiveness and generalizability of PromptCS across different programming languages.
- RQ6: Performance of PromptCS in human evaluation.
To evaluate the quality of the summaries generated by PromptCS, we conduct a human evaluation. We randomly select 100 code snippets from the CSN-Java dataset and generate summaries using PromptCS. We also generate summaries using the task-oriented fine-tuning scheme as a baseline for comparison. We then ask human evaluators, who are AI engineers, to rate the quality of the summaries on a scale from 1 to 5, with 1 being the lowest and 5 being the highest. We collect ratings for both PromptCS and the baseline, and calculate the average ratings for each. The results are shown in Table 9.
From Table 9, it can be observed that PromptCS outperforms the baseline in terms of human evaluation. The average ratings for PromptCS are higher than the average ratings for the baseline, indicating that the summaries generated by PromptCS are of higher quality. This further validates the effectiveness of PromptCS in generating high-quality code summaries.
Summary: PromptCS generates higher-quality code summaries compared to the task-oriented fine-tuning scheme, as indicated by the results of human evaluation.
Link
Title: Making Large Language Models A Better Foundation For Dense Retrieval
Summary:
- The paper proposes LLaRA, a post-hoc adaptation of large language models (LLMs) for dense retrieval tasks.
- LLaRA consists of two pretext tasks, EBAE and EBAR, which aim to enhance LLMs' text embeddings to represent the global context and facilitate the association between query and document.
- LLaRA significantly improves the retrieval performance of LLMs on various benchmarks, outperforming other baseline methods.
- The adaptation process of LLaRA is simple, lightweight, and efficient, and it does not require labeled data.
- LLaRA shows great potential in dense retrieval applications, offering improved retrieval accuracy and generality.
Link
Title: RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair
Summary:
- The paper presents a novel program repair approach called RepairLLaMA, which combines code representations and parameter-efficient fine-tuning to fix bugs using large language models (LLMs).
- The experiments show that RepairLLaMA outperforms baselines and achieves state-of-the-art results on two benchmarks, correctly fixing 125 Defects4J bugs and 82 HumanEval-Java bugs.
- Fine-tuning LLMs with program repair specific code representations improves repair effectiveness.
- Parameter-efficient fine-tuning is more effective than full-parameter fine-tuning for program repair.
- RepairLLaMA generates fewer candidate patches, making it more efficient to evaluate plausibility and correctness.
Link
Title: Making Large Language Models A Better Foundation For Dense Retrieval
Summary:
- The paper introduces LLaRA, a novel approach for adapting large language models (LLMs) for dense retrieval tasks.
- LLaRA consists of two pretext tasks: EBAE (Embedding-Based Auto-Encoding) and EBAR (Embedding-Based Auto-Regression), which aim to adapt LLMs to generate text embeddings that capture the global semantic context.
- LLaRA substantially improves the retrieval performance of LLMs on benchmark datasets like MSMARCO and BEIR.
- The adaptation process of LLaRA is simple, lightweight, and efficient, and it can be directly applied on top of existing pretraining pipelines.
- LLaRA outperforms other baseline methods, including BERT-based retrievers, in terms of retrieval accuracy and generality across different scenarios.
Link