Agents That (Finally) Do Chores: Testing Gemini 3.1 Pro’s Tool Use

I have a confession: I hate "agents."

For the last year, the AI industry has promised us autonomous helpers that can run our lives. In reality, most agents are just chatbots that get stuck in infinite loops trying to check the weather. They hallucinate API calls, forget what they were doing halfway through, or just give up and ask you to do it yourself.

Gemini 3.1 Pro claims to fix this with "improved tool calling and multi-step planning." The benchmarks look great, but benchmarks don't answer emails. So I decided to test it on the most boring, messy, human task I could think of: Scheduling a meeting based on a PDF attachment.

The "Boring Chore" Test

Here is the workflow I wanted the agent to handle:

Read a PDF invoice attached to an email.
Extract the due date and vendor name.
Check my calendar for that date to see if I'm free to review it.
Draft a reply to the vendor confirming receipt and scheduling a review time.

This sounds simple, but it trips up most models. They'll extract the date but fail the calendar check. Or they'll check the calendar but forget the vendor's name.

How Gemini 3.1 Pro Handled It

I set this up using a simple Python script wrapping the Gemini API with access to a few mock "tools" (a calendar look-up tool and an email draft tool).

Step 1: The Extraction

The model nailed the extraction. This wasn't surprising—Gemini's long context window (128k+) is great at reading documents. It found the date (March 14th) instantly.

Step 2: The Logic Check

This is where it usually gets messy. I didn't tell it how to check my availability. I just said, "Check if I'm free."

Gemini 3.1 Pro correctly called the get_calendar_events(date="2026-03-14") tool.

The Surprise: It noticed I had an "all-day" blocker marked "OOO" (Out of Office).
The Reasoning: Instead of just booking the meeting, it paused. It reasoned, "The user is out of office on the due date."

Step 3: The Adjustment

Most agents would stop here or error out. Gemini 3.1 Pro called the tool again for the day before (March 13th). It found a slot at 2 PM.

Then, it drafted the email:
"Hi [Vendor Name], I've received the invoice. Since I'm out of office on the 14th, I'll review and process this on the 13th around 2 PM."

I didn't tell it to check the day before. It just... figured that was the logical next step.

Why "Tool Use" Matters More Than IQ

We talk a lot about "reasoning" in terms of solving math problems. But for agents, reasoning means resilience.

When an agent hits a wall (like an OOO calendar block), does it crash, or does it look for a door? Gemini 3.1 Pro seems to have better "error recovery" than its predecessors. It treats a failed tool call not as a failure, but as new information.

The Agentic "Uncanny Valley"

It wasn't perfect.

Verbosity: It narrated its every move. "I am now checking the calendar. I see you are busy. I will now check the previous day." Useful for debugging, annoying for a user.
Confidence: It was almost too confident. It didn't ask if moving the review to the 13th was okay; it just assumed it was. For an invoice, that's fine. For sending a contract? That's scary.

Conclusion

Gemini 3.1 Pro isn't the Jarvis we were promised, but it might be the intern we actually need. It handles the "boring chores" with a level of competence that finally feels useful rather than experimental.

It still requires supervision—I wouldn't let it access my bank account yet—but for the first time, I felt like I was working with an agent, not debugging one.

[SOURCE NEEDED] for specific "multi-step planning" benchmark scores, though anecdotal performance on the "Boring Chore" test suggests significant improvement.

Login

You've reached your free limit

You ran out of credits