Large language models are increasingly promising on research-level physics reasoning tasks, but it is unclear how researcher-agent interactions affect outcomes. We introduce SCALAR, an Actor-Critic-Judge pipeline for quantum field theory and string theory problems, where an Actor proposes solutions, a Critic gives iterative feedback, and an independent Judge evaluates against reference solutions. Across model families, multi-turn dialogue improves over single-shot attempts, while the value of prompting choices and critique styles depends strongly on the Actor-Critic pairing. SCALAR provides a controlled testbed for identifying interaction structures that help or hinder AI-driven scientific discovery.