FC-Eval

Function Calling Evaluation Tool for LLMs
_{Supports OpenRouter (Cloud) and Ollama (Local) backends}

Installation • Quick Start • Usage • Ollama Setup • Methodology

Overview

FC-Eval is a comprehensive CLI tool for evaluating Large Language Models' function-calling capabilities. Inspired by the Berkeley Function Calling Leaderboard (BFCL) v4 methodology, it provides rigorous testing across 30 unique test cases covering single-turn, multi-turn, and agentic scenarios.

Key Features:

🌐 Dual Backend Support: Evaluate models via OpenRouter (cloud) or Ollama (local)
📊 30 Unique Test Cases: Comprehensive coverage across all function-calling scenarios
🔄 Best of N Trials: Configurable trial count with reliability metrics
⚡ Parallel Execution: Multi-threaded evaluation for faster results
📈 Comprehensive Reporting: JSON and TXT reports with detailed metrics
🎯 AST-Based Validation: Accurate function call matching using abstract syntax trees

Installation

Prerequisites

Python 3.10 or higher
For Ollama testing: Linux/macOS/Windows with WSL

Step 1: Clone the Repository

git clone https://github.com/gauravvij/function-calling-cli.git
cd function-calling-cli

Step 2: Install Python Dependencies

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install the package
pip install -e .

Quick Start

FC-Eval can be run in two ways:

Using the installed CLI (fc-eval) - Supports both OpenRouter and Ollama
Using the standalone script (evaluate_fc.py) - OpenRouter only

Option A: OpenRouter (Cloud) - Easiest

Get an API key at https://openrouter.ai/keys

Set your API key:

export OPENROUTER_API_KEY="your-api-key-here"

Run evaluation using fc-eval:

fc-eval --provider openrouter --models qwen/qwen3.5-9b

Or using the standalone script:

python evaluate_fc.py --models qwen/qwen3.5-9b

Option B: Ollama (Local) - Requires Setup

Install Ollama (see Ollama Setup section)

Create the optimized model:

ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfile

Run evaluation:

fc-eval --provider ollama --models qwen3.5:9b-fc

Ollama Setup

Installing Ollama

Ollama provides a simple installation script for Linux/macOS:

# Install Ollama (official one-liner)
curl -fsSL https://ollama.com/install.sh | sh

This will:

Download and install the Ollama binary
Set up the Ollama service
Start the Ollama server automatically

Verifying Installation

# Check Ollama is installed
ollama --version

# Verify server is running
curl http://localhost:11434/api/tags

Creating the Custom Modelfile

The project includes an optimized Modelfile (qwen3.5-9b-fc.modelfile) that addresses the temperature and system prompt issues identified in our analysis:

FROM qwen3.5:9b

# System prompt optimized for function calling
SYSTEM You are a helpful AI assistant with access to tools/functions. When you need to perform an action, use the available tools by making function calls. Always respond with the correct function call format when a tool is needed.

# Critical parameters for function calling accuracy
PARAMETER temperature 0.0
PARAMETER top_p 0.9
PARAMETER top_k 10
PARAMETER num_ctx 8192
PARAMETER num_predict 4096

Key Configuration Changes:

Parameter	Default	Optimized	Impact
`temperature`	1.0	0.0	Eliminates randomness for deterministic function calls
`top_p`	0.95	0.9	Slightly more focused sampling
`top_k`	20	10	Reduces token selection variety
`num_ctx`	2048	8192	Larger context window
`num_predict`	-1	4096	Maximum response length

Building the Optimized Model

# Create the custom model from the Modelfile
ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfile

# Verify the model was created
ollama list

# Inspect model parameters
ollama show qwen3.5:9b-fc

Pulling the Base Model (if needed)

If you don't have the base model:

# Pull the base Qwen 3.5 9B model
ollama pull qwen3.5:9b

# Then create the custom version
ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfile

Usage

API Key Setup

OpenRouter API Key

FC-Eval requires an OpenRouter API key for cloud-based evaluation.

Option 1: Environment Variable (Recommended)

export OPENROUTER_API_KEY="your-api-key-here"

Add this to your ~/.bashrc or ~/.zshrc for persistence.

Option 2: Command Line Argument

fc-eval --provider openrouter --api-key "your-api-key-here"

Option 3: .env File

Create a .env file in your working directory:

OPENROUTER_API_KEY=your-api-key-here

Get your API key at: https://openrouter.ai/keys

Ollama (Local)

No API key required for Ollama. Ensure the server is running:

# Check if Ollama is running
curl http://localhost:11434/api/tags

Basic Usage

Evaluate with OpenRouter (Cloud)

# Evaluate default models via OpenRouter
fc-eval --provider openrouter

# Evaluate specific models
fc-eval --provider openrouter --models qwen/qwen3.5-9b qwen/qwen3.5-27b

# Run with parallel execution
fc-eval --provider openrouter --mode parallel --max-workers 10

Evaluate with Ollama (Local)

# Evaluate local Ollama models
fc-eval --provider ollama

# Evaluate specific local model
fc-eval --provider ollama --models qwen3.5:9b-fc

# Run with sequential mode (recommended for local testing)
fc-eval --provider ollama --mode sequential

Parallel vs Sequential Execution

Parallel Execution (recommended for cloud):

fc-eval --provider openrouter --mode parallel --max-workers 10

Sequential Execution (recommended for local/debugging):

fc-eval --provider ollama --mode sequential

Custom Models

Evaluate specific models:

# OpenRouter models
fc-eval --provider openrouter --models openai/gpt-4o anthropic/claude-3.5-sonnet

# Ollama models
fc-eval --provider ollama --models llama3.2 mistral

Multiple Trials (Best of N)

Run multiple trials per test for reliability metrics (default: 3):

fc-eval --provider openrouter --trials 5

A test passes if at least one trial succeeds (Best of N logic). Reliability is reported as the percentage of trials that passed.

Category Filtering

Run only specific test categories:

# Single-turn tests only
fc-eval --provider openrouter --category single_turn

# Multi-turn tests only
fc-eval --provider openrouter --category multi_turn

# Agentic tests only
fc-eval --provider openrouter --category agentic

Custom Output Directory

Save reports to a custom directory:

fc-eval --provider openrouter --output-dir ./my_results

Features

Dual Backend Support: Test models via OpenRouter (cloud) or Ollama (local)
30 Unique Test Cases: Comprehensive coverage across single-turn, multi-turn, and agentic scenarios
Best of N Trials: Configurable trial count with reliability metrics
Parallel Execution: Multi-threaded evaluation for faster results
Comprehensive Reporting: JSON and TXT reports with detailed metrics
AST-Based Validation: Accurate function call matching using abstract syntax trees
Category Breakdown: Detailed analysis by test category and subcategory
Latency Tracking: Performance metrics for each model

Methodology

Test Categories

Single-Turn (16 tests)
- Simple function calls
- Multiple function selection
- Parallel function calling
- Parallel multiple functions
- Relevance detection
Multi-Turn (8 tests)
- Base multi-turn conversations
- Missing parameter handling
- Missing function scenarios
- Long context management
Agentic (6 tests)
- Web search simulation
- Memory/state management
- Format sensitivity

Evaluation Logic

Best of N: A test passes if at least one of N trials succeeds
Reliability: Percentage of trials that passed (e.g., 2/3 trials = 66.7% reliability)
AST Matching: Function calls validated using abstract syntax tree comparison

Troubleshooting

Ollama Connection Issues

Problem: Connection refused error when using Ollama provider

Solution:

# Check if Ollama is running
curl http://localhost:11434/api/tags

# If not running, start the server
ollama serve

Model Not Found (Ollama)

Problem: model not found error

Solution:

# List available models
ollama list

# Pull the required model
ollama pull qwen3.5:9b

# Create custom model with Modelfile
ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfile

OpenRouter API Errors

Problem: 401 Unauthorized or 429 Rate Limited

Solution:

# Verify API key is set
echo $OPENROUTER_API_KEY

# Set API key
export OPENROUTER_API_KEY="your-key-here"

# For rate limits, use sequential mode with fewer workers
fc-eval --provider openrouter --mode sequential --trials 1

Low Accuracy on Local Models

Problem: Local Ollama models show significantly lower accuracy than OpenRouter

Explanation: This is expected due to:

Quantization: Ollama uses Q4_K_M (4-bit) quantization by default
System Prompts: OpenRouter may apply additional optimizations
API Optimizations: Cloud providers may use response format enforcement

Recommendation: Use the custom Modelfile (qwen3.5-9b-fc.modelfile) for best local results, but expect ~60 percentage point gap vs OpenRouter.

Performance Comparison: OpenRouter vs Ollama

Based on our analysis with Qwen 3.5 9B:

Metric	OpenRouter (Cloud)	Ollama (Local)	Difference
Accuracy	83.3%	22.2%	-61.1 pp
Temperature	0.0 (default)	1.0 (default)	Critical
Avg Latency	~1600ms	~8900ms	5.5x slower
Quantization	Unknown (likely F16)	Q4_K_M (4-bit)	Precision loss

Recommendation: Use OpenRouter for production function-calling tasks requiring high accuracy. Use Ollama for local development, privacy-sensitive applications, or offline scenarios with acceptable accuracy trade-offs.

Files Reference

File	Description
`evaluate_fc.py`	Main evaluation script
`qwen3.5-9b-fc.modelfile`	Optimized Ollama Modelfile for function calling
`FUNCTION_CALLING_ACCURACY_ANALYSIS.md`	Detailed discrepancy analysis report
`results/`	Directory containing evaluation reports

License

MIT License - see LICENSE file for details.

_{Built with ❤️ by NEO}

NEO - A fully autonomous AI Engineer

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
__pycache__		__pycache__
results		results
src/fc_eval		src/fc_eval
README.md		README.md
evaluate_fc.py		evaluate_fc.py
evaluation.log		evaluation.log
pyproject.toml		pyproject.toml
qwen3.5-9b-fc.modelfile		qwen3.5-9b-fc.modelfile
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

FC-Eval

Overview

Installation

Prerequisites

Step 1: Clone the Repository

Step 2: Install Python Dependencies

Quick Start

Option A: OpenRouter (Cloud) - Easiest

Option B: Ollama (Local) - Requires Setup

Ollama Setup

Installing Ollama

Verifying Installation

Creating the Custom Modelfile

Building the Optimized Model

Pulling the Base Model (if needed)

Usage

API Key Setup

OpenRouter API Key

Ollama (Local)

Basic Usage

Evaluate with OpenRouter (Cloud)

Evaluate with Ollama (Local)

Parallel vs Sequential Execution

Custom Models

Multiple Trials (Best of N)

Category Filtering

Custom Output Directory

Features

Methodology

Test Categories

Evaluation Logic

Troubleshooting

Ollama Connection Issues

Model Not Found (Ollama)

OpenRouter API Errors

Low Accuracy on Local Models

Performance Comparison: OpenRouter vs Ollama

Files Reference

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages