Function Calling Evaluation Tool for LLMs
Supports OpenRouter (Cloud) and Ollama (Local) backends
Installation • Quick Start • Usage • Ollama Setup • Methodology
FC-Eval is a comprehensive CLI tool for evaluating Large Language Models' function-calling capabilities. Inspired by the Berkeley Function Calling Leaderboard (BFCL) v4 methodology, it provides rigorous testing across 30 unique test cases covering single-turn, multi-turn, and agentic scenarios.
Key Features:
- 🌐 Dual Backend Support: Evaluate models via OpenRouter (cloud) or Ollama (local)
- 📊 30 Unique Test Cases: Comprehensive coverage across all function-calling scenarios
- 🔄 Best of N Trials: Configurable trial count with reliability metrics
- ⚡ Parallel Execution: Multi-threaded evaluation for faster results
- 📈 Comprehensive Reporting: JSON and TXT reports with detailed metrics
- 🎯 AST-Based Validation: Accurate function call matching using abstract syntax trees
- Python 3.10 or higher
- For Ollama testing: Linux/macOS/Windows with WSL
git clone https://github.com/gauravvij/function-calling-cli.git
cd function-calling-cli# Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install the package
pip install -e .FC-Eval can be run in two ways:
- Using the installed CLI (
fc-eval) - Supports both OpenRouter and Ollama - Using the standalone script (
evaluate_fc.py) - OpenRouter only
-
Get an API key at https://openrouter.ai/keys
-
Set your API key:
export OPENROUTER_API_KEY="your-api-key-here"
-
Run evaluation using fc-eval:
fc-eval --provider openrouter --models qwen/qwen3.5-9b
Or using the standalone script:
python evaluate_fc.py --models qwen/qwen3.5-9b
-
Install Ollama (see Ollama Setup section)
-
Create the optimized model:
ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfile
-
Run evaluation:
fc-eval --provider ollama --models qwen3.5:9b-fc
Ollama provides a simple installation script for Linux/macOS:
# Install Ollama (official one-liner)
curl -fsSL https://ollama.com/install.sh | shThis will:
- Download and install the Ollama binary
- Set up the Ollama service
- Start the Ollama server automatically
# Check Ollama is installed
ollama --version
# Verify server is running
curl http://localhost:11434/api/tagsThe project includes an optimized Modelfile (qwen3.5-9b-fc.modelfile) that addresses the temperature and system prompt issues identified in our analysis:
FROM qwen3.5:9b
# System prompt optimized for function calling
SYSTEM You are a helpful AI assistant with access to tools/functions. When you need to perform an action, use the available tools by making function calls. Always respond with the correct function call format when a tool is needed.
# Critical parameters for function calling accuracy
PARAMETER temperature 0.0
PARAMETER top_p 0.9
PARAMETER top_k 10
PARAMETER num_ctx 8192
PARAMETER num_predict 4096Key Configuration Changes:
| Parameter | Default | Optimized | Impact |
|---|---|---|---|
temperature |
1.0 | 0.0 | Eliminates randomness for deterministic function calls |
top_p |
0.95 | 0.9 | Slightly more focused sampling |
top_k |
20 | 10 | Reduces token selection variety |
num_ctx |
2048 | 8192 | Larger context window |
num_predict |
-1 | 4096 | Maximum response length |
# Create the custom model from the Modelfile
ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfile
# Verify the model was created
ollama list
# Inspect model parameters
ollama show qwen3.5:9b-fcIf you don't have the base model:
# Pull the base Qwen 3.5 9B model
ollama pull qwen3.5:9b
# Then create the custom version
ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfileFC-Eval requires an OpenRouter API key for cloud-based evaluation.
Option 1: Environment Variable (Recommended)
export OPENROUTER_API_KEY="your-api-key-here"Add this to your ~/.bashrc or ~/.zshrc for persistence.
Option 2: Command Line Argument
fc-eval --provider openrouter --api-key "your-api-key-here"Option 3: .env File
Create a .env file in your working directory:
OPENROUTER_API_KEY=your-api-key-here
Get your API key at: https://openrouter.ai/keys
No API key required for Ollama. Ensure the server is running:
# Check if Ollama is running
curl http://localhost:11434/api/tags# Evaluate default models via OpenRouter
fc-eval --provider openrouter
# Evaluate specific models
fc-eval --provider openrouter --models qwen/qwen3.5-9b qwen/qwen3.5-27b
# Run with parallel execution
fc-eval --provider openrouter --mode parallel --max-workers 10# Evaluate local Ollama models
fc-eval --provider ollama
# Evaluate specific local model
fc-eval --provider ollama --models qwen3.5:9b-fc
# Run with sequential mode (recommended for local testing)
fc-eval --provider ollama --mode sequentialParallel Execution (recommended for cloud):
fc-eval --provider openrouter --mode parallel --max-workers 10Sequential Execution (recommended for local/debugging):
fc-eval --provider ollama --mode sequentialEvaluate specific models:
# OpenRouter models
fc-eval --provider openrouter --models openai/gpt-4o anthropic/claude-3.5-sonnet
# Ollama models
fc-eval --provider ollama --models llama3.2 mistralRun multiple trials per test for reliability metrics (default: 3):
fc-eval --provider openrouter --trials 5A test passes if at least one trial succeeds (Best of N logic). Reliability is reported as the percentage of trials that passed.
Run only specific test categories:
# Single-turn tests only
fc-eval --provider openrouter --category single_turn
# Multi-turn tests only
fc-eval --provider openrouter --category multi_turn
# Agentic tests only
fc-eval --provider openrouter --category agenticSave reports to a custom directory:
fc-eval --provider openrouter --output-dir ./my_results- Dual Backend Support: Test models via OpenRouter (cloud) or Ollama (local)
- 30 Unique Test Cases: Comprehensive coverage across single-turn, multi-turn, and agentic scenarios
- Best of N Trials: Configurable trial count with reliability metrics
- Parallel Execution: Multi-threaded evaluation for faster results
- Comprehensive Reporting: JSON and TXT reports with detailed metrics
- AST-Based Validation: Accurate function call matching using abstract syntax trees
- Category Breakdown: Detailed analysis by test category and subcategory
- Latency Tracking: Performance metrics for each model
-
Single-Turn (16 tests)
- Simple function calls
- Multiple function selection
- Parallel function calling
- Parallel multiple functions
- Relevance detection
-
Multi-Turn (8 tests)
- Base multi-turn conversations
- Missing parameter handling
- Missing function scenarios
- Long context management
-
Agentic (6 tests)
- Web search simulation
- Memory/state management
- Format sensitivity
- Best of N: A test passes if at least one of N trials succeeds
- Reliability: Percentage of trials that passed (e.g., 2/3 trials = 66.7% reliability)
- AST Matching: Function calls validated using abstract syntax tree comparison
Problem: Connection refused error when using Ollama provider
Solution:
# Check if Ollama is running
curl http://localhost:11434/api/tags
# If not running, start the server
ollama serveProblem: model not found error
Solution:
# List available models
ollama list
# Pull the required model
ollama pull qwen3.5:9b
# Create custom model with Modelfile
ollama create qwen3.5:9b-fc -f qwen3.5-9b-fc.modelfileProblem: 401 Unauthorized or 429 Rate Limited
Solution:
# Verify API key is set
echo $OPENROUTER_API_KEY
# Set API key
export OPENROUTER_API_KEY="your-key-here"
# For rate limits, use sequential mode with fewer workers
fc-eval --provider openrouter --mode sequential --trials 1Problem: Local Ollama models show significantly lower accuracy than OpenRouter
Explanation: This is expected due to:
- Quantization: Ollama uses Q4_K_M (4-bit) quantization by default
- System Prompts: OpenRouter may apply additional optimizations
- API Optimizations: Cloud providers may use response format enforcement
Recommendation: Use the custom Modelfile (qwen3.5-9b-fc.modelfile) for best local results, but expect ~60 percentage point gap vs OpenRouter.
Based on our analysis with Qwen 3.5 9B:
| Metric | OpenRouter (Cloud) | Ollama (Local) | Difference |
|---|---|---|---|
| Accuracy | 83.3% | 22.2% | -61.1 pp |
| Temperature | 0.0 (default) | 1.0 (default) | Critical |
| Avg Latency | ~1600ms | ~8900ms | 5.5x slower |
| Quantization | Unknown (likely F16) | Q4_K_M (4-bit) | Precision loss |
Recommendation: Use OpenRouter for production function-calling tasks requiring high accuracy. Use Ollama for local development, privacy-sensitive applications, or offline scenarios with acceptable accuracy trade-offs.
| File | Description |
|---|---|
evaluate_fc.py |
Main evaluation script |
qwen3.5-9b-fc.modelfile |
Optimized Ollama Modelfile for function calling |
FUNCTION_CALLING_ACCURACY_ANALYSIS.md |
Detailed discrepancy analysis report |
results/ |
Directory containing evaluation reports |
MIT License - see LICENSE file for details.
Built with ❤️ by NEO
NEO - A fully autonomous AI Engineer