BinAI: The $100 AI Scientist That Never Sleeps Inspiration The idea sparked from a simple question: "Can AI truly understand spatial relationships, or is it just good at predicting text?" While LLMs like Claude can describe spatial concepts beautifully, we wanted to test their understanding with actual physics. Most spatial reasoning benchmarks use static text descriptions, but the real world is dynamic, 3D, and governed by physics laws. We envisioned an AI scientist that could run experiments 24/7, analyze results, research related work, and write papers autonomously—democratizing research for anyone with $100 in cloud credits. The name "BinAI" comes from the Kabbalistic concept of Binah (understanding/intelligence) + AI, representing the bridge between wisdom and analytical understanding. What it does BinAI is a fully automated research pipeline that:

Runs spatial reasoning experiments using PyBullet physics simulation (60 experiments/day) Analyzes results with Claude AI in real-time Researches related work using Perplexity's search API Stores structured data in BigQuery for trend analysis Generates research papers automatically every week Operates 24/7 with zero human intervention

The system tests LLMs on dynamic spatial tasks like "move the red cube near the blue sphere" and measures success rates, spatial understanding patterns, and reasoning quality. It's achieved 100% task completion across all trials, generating rich datasets for spatial AI research. How we built it Architecture: Cloud-native Python pipeline on Google Cloud Platform Core Components:

Spatial Environment: PyBullet physics engine wrapped in FastAPI LLM Integration: Direct Anthropic API calls for spatial analysis Research Integration: Perplexity API for automated literature review Data Pipeline: BigQuery for experiment storage and trend analysis Paper Generation: Claude-powered automated research writing Orchestration: Cloud Scheduler + Cloud Run for 24/7 operation

Development Journey:

Started with basic PyBullet spatial environment Added Three.js visualization for real-time experiment monitoring Integrated Claude for spatial reasoning analysis Connected Perplexity for automated research synthesis Built production pipeline with BigQuery storage Created automated paper generation system Deployed on Google Cloud with secure authentication

Challenges we ran into Technical Hurdles:

PyBullet Installation Nightmare: Standard pip install failed on macOS ARM64 due to compilation errors. Solved by switching to Conda environment management. WebSocket Handler Debugging: Spent hours on TypeError and AttributeError issues with WebSocket signatures before finding the correct async handler pattern. MCP Integration Complexity: Initially pursued Model Context Protocol integration but it added unnecessary complexity. Pivoted to direct API calls for reliability. Configuration Loop Hell: Nearly got trapped in endless MCP server configuration. Broke the cycle by focusing on working code over perfect architecture. Docker Security Issues: Had to redesign secret management to avoid embedding API keys in containers.

Strategic Challenges:

Scope Creep: Resisted the temptation to build every possible feature and focused on core research value Over-Engineering: Almost built complex microservices when a simple FastAPI app was sufficient Budget Optimization: Designed cost controls to operate efficiently within $100/month budget

Accomplishments that we're proud of

🎯 100% Success Rate: Achieved perfect task completion across all spatial reasoning experiments ⚡ Real-Time Research Integration: First system to combine live physics simulation + LLM analysis + automated literature review 💰 Cost Innovation: Entire research lab operates for $100/month using smart API optimization 📊 Production Scale: Generates 420 experiments and 1 research paper per week autonomously 🔒 Enterprise Security: Implemented proper secret management and authenticated Cloud Run deployment 📈 Research Quality: Statistical significance with rich multi-modal datasets suitable for academic publication 🤖 Novel Methodology: Created new evaluation paradigm for spatial AI that bridges simulation and language

What we learned Technical Insights:

Simplicity Wins: Direct API calls often outperform complex abstraction layers Conda > Pip: For scientific computing with binary dependencies, Conda is more reliable Security First: Proper secret management prevents major headaches in production Measure Everything: BigQuery analytics revealed patterns invisible in manual testing

Research Discoveries:

LLM Spatial Capability: Claude shows consistent spatial reasoning in dynamic environments Cross-Modal Integration: Combining physics simulation with language analysis yields novel insights Automated Research Value: AI-generated literature synthesis accelerates research cycles Cost-Effective Scale: Cloud automation makes continuous research accessible to individual researchers

Project Management:

Anti-Loop Strategy: Focus on working software over perfect architecture prevents configuration hell Iterative Building: Start with MVP, add features incrementally Real Testing: Production deployment reveals issues invisible in local development

What's next for BinAI Short-term (1-3 months):

Task Diversity: Add rotation, stacking, and multi-object spatial challenges Multi-LLM Comparison: Test GPT-4, Gemini, and specialized models side-by-side Research Publication: Submit methodology paper to NeurIPS/ICML 2025 Open Source Release: Share framework for community research

Medium-term (3-6 months):

Visual Integration: Add computer vision for object recognition challenges Robotic Interface: Connect to real robot arms for embodied spatial testing Advanced Analytics: Develop failure mode analysis and improvement recommendations Industry Partnerships: Collaborate with robotics companies on spatial AI evaluation

Long-term Vision:

Research Platform: Scale to support multiple research institutions Curriculum Learning: Adaptive difficulty progression based on AI capability Scientific Discovery: Enable automated hypothesis generation and testing AGI Evaluation: Benchmark for measuring spatial intelligence in artificial general intelligence

Built With

Share this project:

Updates