Machine Learning Pills: Weekly Dose

Weekly Dose #2 - The AI Race Moved From Models to Deployment

David Andrés — Fri, 15 May 2026 06:31:18 GMT

📰 The Weekly Dose

Welcome back to the Weekly Dose: your 5-minute breakdown of the AI/ML news that changed how us builders should think this week.

This second edition covers 7 May to 14 May 2026. No stale benchmark victory laps. No “agents will change everything” filler. Just the five stories that affect how you build, deploy, secure, or buy AI systems.

This week: OpenAI turned enterprise deployment into a $4B+ services machine, Anthropic proved last week’s finance-agent push was the start of a vertical packaging machine, Codex and Claude Code turned pricing pages into competitive weapons, Google reframed Android as an agentic execution layer, and OpenAI Daybreak pushed AI security from “model capability” into workflow product.

Interested in sponsoring this section (or any others)? Contact me here:

1. OpenAI is absorbing the systems integrator layer

OpenAI launched the OpenAI Deployment Company on May 11, a new majority-controlled enterprise deployment arm designed to help companies put OpenAI systems into real operations. The company starts with an initial $4B investment, backed by 19 financial and consulting partners including BBVA, TPG, Advent, Bain Capital, Brookfield, Goldman Sachs, McKinsey, and Capgemini. OpenAI is also acquiring Tomoro, an AI engineering and consulting firm with around 150 specialists, to give the new company implementation capacity from day one.

This is not another “enterprise AI platform” announcement. It is OpenAI moving directly into the deployment work that usually sits with systems integrators, consultancies, and internal transformation teams: workflow mapping, data access, security review, implementation, governance, and actually getting the thing used.

🫵 Why it matters to you: If your AI roadmap depends on months of custom glue code, internal enablement, stakeholder wrangling, and integration work, your vendor landscape just changed. OpenAI is not only selling the model anymore. It is selling the implementation path.

🤫 The subtext nobody says out loud: The frontier labs have discovered the least glamorous truth in enterprise software: distribution without deployment is just shelfware. The next fight is not only who has the smartest model. It is who can walk into a bank, insurer, retailer, or telco and make the model survive procurement, security, compliance, and daily workflow reality.

2. Anthropic turned last week’s finance-agent playbook into a vertical packaging machine

Last week, Anthropic packaged finance workflows. This week, it proved that wasn’t a one-off.

On May 12, Anthropic expanded Claude for legal work, adding Claude Cowork tools and integrations for legal research, contracts, documents, case law, and practice-specific workflows. Reported integrations include tools such as CourtListener, Thomson Reuters Westlaw, Box, Harvey, and others, plus pre-built legal skills for areas such as employment, privacy, product law, legal clinics, and legal education.

Then on May 13, Anthropic launched Claude for Small Business, connecting Claude to tools like QuickBooks, PayPal, HubSpot, Canva, DocuSign, Google Workspace, and Microsoft 365. The package runs through Claude Cowork and includes built-in workflows across finance, sales, HR, marketing, operations, and customer service.

@AnthropicAI has launched Claude for Small Business, with 15 ready-to-run workflows spanning QuickBooks, PayPal, HubSpot, Canva, DocuSign, Google Workspace, and Microsoft 365.\n\nThat could help level the playing field for smaller teams, especially with free AI fluency training","username":"evansantoslaw","name":"Evan Santos","profile_image_url":"https://pbs.substack.com/profile_images/2042331008268120064/IZyyZu3w_normal.jpg","date":"2026-05-13T18:34:52.000Z","photos":[],"quoted_tweet":{},"reply_count":4,"retweet_count":11,"like_count":70,"impression_count":44723,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

The move is clear: Anthropic is turning Claude from a general-purpose assistant into a set of pre-wired business surfaces. Finance last week. Legal and small business this week. More verticals will follow.

🫵 Why it matters to you: If you’re building internal AI tools for legal, accounting, HR, sales ops, compliance, or SMB workflows, “we wrapped a model around our docs” is no longer enough. The new baseline is model + connectors + permissions + approval gates + workflow templates.

🤫 The subtext nobody says out loud: Anthropic is not just selling Claude. It is selling migration pain in reverse. Once Claude is wired into a team’s documents, legal research, CRM, invoicing, email, and approval flows, switching vendors becomes an integration problem; not a model-quality debate.

3. Codex and Claude Code made pricing pages the new battleground

On May 14, OpenAI and Anthropic moved almost simultaneously on AI coding tools. OpenAI offered companies two months of free Codex usage if they sign up within 30 days.

Less than an hour later, Anthropic increased Claude Code weekly usage limits by 50% for Pro, Max, Team, and Enterprise users until July 13.

Axios also reported that Anthropic is putting some outside agent-tool usage behind a separate credit meter, highlighting a broader shift away from simple “all-you-can-eat” AI subscriptions as agents consume far more compute than normal chat usage.

This is not a feature war anymore. It is a retention war. Free usage windows, temporary limit boosts, separate credits, agent meters, and plan-specific caps are becoming the actual product surface developers feel every day.

🫵 Why it matters to you: If you use Codex, Claude Code, Cursor, Devin, or similar tools for real engineering work, you should re-benchmark under the current limits. Measure completion rate, retry rate, wall-clock time, approval interruptions, and cost per finished task, not just “which one felt smarter in a demo.”

🤫 The subtext nobody says out loud: The coding-agent fight is moving from capability demos to usage economics. Developers do not abandon tools only because the model makes mistakes. They abandon them when the agent runs out of runway halfway through a migration, refactor, or bug hunt.

4. Google turned Android into an agentic execution layer

Google used its Android Show: I/O Edition on May 12 to preview a major Android + Gemini Intelligence push. The upgrades include Gemini Intelligence across Android phones, Chrome auto-browse, smarter Autofill, AI-generated widgets, Gboard’s Rambler dictation cleanup, and deeper Android Auto features.

#TheAndroidShow was packed with new updates to make your everyday Android experience even better.\n\nHere are just a few… 👇🧵","username":"Android","name":"Android","profile_image_url":"https://pbs.substack.com/profile_images/2052036388401664000/e1gu5tkc_normal.jpg","date":"2026-05-12T17:30:16.000Z","photos":[],"quoted_tweet":{},"reply_count":59,"retweet_count":77,"like_count":1721,"impression_count":190653,"expanded_url":null,"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM">

The important shift is that Gemini is being positioned less as a chatbot and more as an action layer. Google described examples like interacting directly with apps, turning lists into shopping baskets, ordering food, filling complex forms, and queuing actions for final user confirmation. Chrome auto-browse is expected to bring similar automation to websites from late June.

Android Auto is also getting a broader AI and interface upgrade, including deeper Gemini integration, richer widgets, full-screen support for unusual car displays, and more immersive Google Maps navigation.

🫵 Why it matters to you: If you build consumer apps, Android products, mobile commerce flows, or services that rely on users manually tapping through screens, the interface contract is changing. Your app needs clear intents, clean state, sensible permissions, and confirmation flows that an assistant can operate safely.

🤫 The subtext nobody says out loud: Google does not need Gemini to win every benchmark if Gemini is already where the user acts. The model embedded in the phone, browser, keyboard, car, and laptop has a distribution advantage that a smarter model in a separate tab has to fight uphill to beat.

5. OpenAI Daybreak made AI security a workflow product

OpenAI launched Daybreak on May 11 US time as a cybersecurity initiative focused on finding and fixing software vulnerabilities before attackers exploit them. Daybreak uses OpenAI models, Codex, Codex Security, and security partners to create threat models from code, identify likely attack paths, validate vulnerabilities, and automate detection of higher-risk issues.

This is OpenAI’s clearest answer to Anthropic’s Claude Mythos / Project Glasswing push. The Verge reports that Daybreak involves specialized cyber models including GPT-5.5 with Trusted Access for Cyber and GPT-5.5-Cyber, which began rolling out to vetted cyber defenders.

The key difference: Daybreak is not just “a cyber model.” It is a workflow wrapper around the model: repo context, threat modeling, vulnerability validation, detection, and eventually patching. That makes it much more interesting for AppSec teams than another leaderboard screenshot.

🫵 Why it matters to you: If you work in AppSec, detection engineering, platform engineering, or AI governance, you should start planning for AI systems that can inspect codebases, reason across attack paths, and propose mitigations. The blocker will not only be model quality. It will be access control, logging, sandboxing, patch review, and trust.

🤫 The subtext nobody says out loud: Security vendors are about to be judged on whether they can close the loop. Finding issues is table stakes. The new question is: can the system understand exploitability, prioritize the real risk, generate a fix, validate it, and leave a clean audit trail?

🛠️ Practical takeaways:

Define which repos an AI security agent should be allowed to inspect.
Decide what the agent can do automatically versus what requires human approval.
Require logs for code access, generated findings, attempted reproduction, and patch suggestions.
Keep patch generation separate from patch merge until your review process is mature.
Start building evals for vulnerability triage quality, not just code-generation quality.

So, what does all of this mean in practice? Here’s our take, and your to-do list for the week ahead.

💡 Our take

This week’s theme is simple: AI is moving from capability into distribution, packaging, and deployment.

OpenAI is building the deployment muscle. Anthropic is packaging vertical workflows. Google is embedding Gemini where users already act. Codex and Claude Code are fighting over limits, credits, and retention. Daybreak turns cyber capability into an operational workflow.

The model is still important. But the model is no longer the product by itself.

The key signals from this week:

Deployment is becoming part of the AI product. The OpenAI Deployment Company is a bet that enterprise AI needs implementation capacity, not just APIs.
Vertical workflow packaging is accelerating. Anthropic moved from finance to legal and small business in back-to-back weeks.
Coding-agent competition is now commercial infrastructure. Limits, credits, promos, and metering shape whether developers can actually use the tools.
Distribution is becoming agentic. Google is turning Android into a surface where Gemini can act, not just answer.
AI security is becoming a managed workflow. Daybreak points to security agents that reason across repos, attack paths, detection, and fixes.

The better question is no longer “which model is best?”

It is: which AI system owns the path from intent to completed work; with the integrations, controls, and deployment muscle to make it real?

📌 Your to-do list

Map where your AI projects still depend on manual deployment work. If the hard part is integration, governance, training, or adoption, compare your roadmap against vendor-led deployment options.
Review your vertical workflows against packaged AI offerings. Legal, finance, SMB ops, HR, sales, and accounting are becoming vendor-shaped categories fast.
Re-benchmark your coding-agent setup. Track task completion, retries, approvals, wall-clock time, and effective cost under current Codex and Claude Code limits.
Audit pricing exposure for agentic workloads. Separate human chat usage from autonomous agent usage in your budget model. They do not scale the same way.
Make your Android surfaces assistant-readable. Review intents, permissions, state handling, checkout flows, and confirmation steps before Gemini-style automation becomes the default UX.
Write an AI security-agent access policy. Define repo access, sandboxing, logging, patch authority, human review, and incident response before tools like Daybreak become procurement conversations.
Build security evals for triage, not just code generation. Measure whether AI security tools correctly prioritize exploitability, blast radius, false positives, and patch quality.

See you next week.

Weekly Dose #1 - AI’s Next Battlefield Isn’t Models. It’s Systems

David Andrés — Fri, 08 May 2026 06:02:18 GMT

📰The Weekly Dose

Welcome to the Weekly Dose: your 5-minute breakdown of the AI/ML news that changed how us builders should think this week.

This first edition covers 30 April to 7 May 2026. No stale benchmark victory laps. No “this might be big someday” filler. Just the five stories that affect how you build, deploy, secure, or buy AI systems.

This week: finance agents became enterprise products, ML supply-chain attacks escalated, OpenAI upgraded ChatGPT’s default model, Anthropic bought massive new compute capacity, and DeepSeek proved “cheap and capable” is becoming strategically dangerous.

Issue sponsored by tracebloc

Most of us have a dataset we can’t share, a complex problem, and someone outside the team who could probably help. There’s no good way to bridge the two. Access takes months, if it happens at all.

tracebloc is the tool for that. For making confidential data accessible to collaborators, universities, freelancers, startups, or consultants. For sharing complex problems, building and innovating together.

You set up your own ML workspace on your infra with one line of code. Invite people by email, they train and fine-tune models on your data inside containers. Data stays in your infra. You see a leaderboard with how each contributor performs on your problem.

It’s free. Live in minutes.

🔗 See how it works

1. OpenAI just made voice agents a serious engineering surface

On 7 May, OpenAI announced three new realtime audio models through its API. GPT‑Realtime‑2 is the first voice model with GPT‑5‑class reasoning, built to handle harder requests and carry conversations forward naturally. GPT‑Realtime‑Translate does live speech translation from 70+ input languages into 13 output languages while keeping pace with the speaker. GPT‑Realtime‑Whisper is a streaming transcription model that converts speech to text as the person is still talking.

The gap over the previous generation is measurable: GPT‑Realtime‑2 with high reasoning scored 96.6% on Big Bench Audio, compared to 81.4% for GPT‑Realtime‑1.5. On Audio MultiChallenge instruction following, the xhigh reasoning tier scored 48.5% versus 34.7% for the prior model. New developer-facing features include preambles (”let me check that” before a tool call), parallel tool calls mid-conversation, and a context window expanded from 32K to 128K tokens. On pricing, GPT‑Realtime‑2 is priced at $32/1M audio input tokens and $64/1M audio output tokens. GPT‑Realtime‑Translate runs at $0.034 per minute and GPT‑Realtime‑Whisper at $0.017 per minute.

🫵 Why it matters to you: If your product has a voice layer (customer support, accessibility, field agents, meeting transcription) the capability bar just moved. Real GPT-5-class reasoning running natively in a voice model, with parallel tool calls and 128K context, is a different product category than what shipped twelve months ago. The “voice is too unreliable for production” objection is getting harder to make.

🤫 The subtext nobody says out loud: Live translation across 70+ languages, priced per minute, is the end of "we'll add multilingual support later." It's also a quiet play for every call centre, clinic, and government service that currently employs human interpreters for routine interactions. OpenAI isn't just selling a voice API, it's repricing a labour category.

2. Finance agents just became the new enterprise AI battleground

On 5 May, Anthropic released 10 ready-to-run financial services agent templates covering pitchbook creation, KYC screening, earnings review, model building, market research, valuation, general-ledger reconciliation, month-end close, and statement auditing. They ship as Claude Cowork and Claude Code plugins, with cookbooks for Claude Managed Agents.

What matters isn’t just the agent count. It’s the full production stack around them: governed data connectors, credential vaults, permissions, audit logs, and human review checkpoints the boring pieces teams usually spend months building themselves. Claude now works across Excel, PowerPoint, and Word (Outlook coming soon), with new connectors for Dun & Bradstreet, IBISWorld, Verisk, and a Moody’s MCP app covering 600M+ companies.

OpenAI moved on the same front, announcing a PwC collaboration on finance agents for planning, forecasting, reporting, and accounting close. Anthropic also announced a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs to help mid-sized businesses deploy Claude in real operations.

🫵 Why it matters to you: If your team is maintaining an internal RAG pipeline for KYC, invoice reconciliation, analyst research, or compliance work, the question is no longer just “can we build this?” It’s “are we out-engineering a vendor that already ships with the connectors, audit trail, model, and implementation team?”

🤫 The subtext nobody says out loud: The moat isn’t the model anymore. It’s distribution, connectors, workflow templates, and forward-deployed engineers. The AI industry’s answer to “agents don’t work in regulated industries” turns out to be more integrations and more humans around the model. Slightly less sci-fi. Much more useful.

Let us know what you think about this new section in the comments.

Don’t forget to like and share this with your contacts.

3. Mini Shai-Hulud hit the ML supply chain

On 30 April, two malicious versions of the lightning PyPI package, 2.6.2 and 2.6.3, were published with credential-stealing code. The critical detail: the payload runs on import, not just on install. The compromised versions shipped a hidden _runtime directory that downloaded the Bun JavaScript runtime and executed a roughly 11 MB obfuscated credential stealer. The last clean release is 2.6.1.

An AI security scanner flagged both versions 18 minutes after publication. The same campaign hit intercom-client@7.0.4 on npm that same day, using credentials from a compromised developer account. The payload targeted cloud keys, GitHub tokens, SSH keys, and environment variables via a preinstall hook.

🫵 Why it matters to you: ML dependencies are now premium targets. They sit next to cloud credentials, model weights, notebooks, experiment trackers, and CI/CD pipelines. A compromised training dependency can have a larger blast radius than a compromised web library because those environments are usually more privileged and far less locked down.

🤫 The subtext nobody says out loud: “Just pip install the thing” is now a security decision. The AI stack has inherited npm’s supply-chain problems except now the packages live next to AWS keys, Hugging Face tokens, and private model artifacts.

🛠️ Practical takeaways:

Block and audit lightning==2.6.2 and lightning==2.6.3. Pin to 2.6.1 until you’ve verified a clean later release.
Treat any environment that imported those versions as compromised. Rotate cloud keys, GitHub PATs, npm tokens, and SSH keys.
Audit lockfiles and CI logs. Look for lightning 2.6.2/2.6.3 and intercom-client 7.0.4 from 30 April onward.
Review .github/workflows/ files added or changed after 30 April.
Move CI to short-lived OIDC tokens. Long-lived credentials are exactly what import-time payloads are hunting.
Harden high-risk accounts. Passkeys, hardware keys, shorter sessions, and tighter recovery are basic hygiene for anyone with access to production AI systems.

4. GPT-5.5 Instant became ChatGPT’s new default

OpenAI rolled out GPT-5.5 Instant as ChatGPT’s default model on 5 May, replacing GPT-5.3 Instant for all users. The headline claim: 52.5% fewer hallucinated claims on high-stakes prompts in medicine, law, and finance, and 37.3% fewer inaccurate claims on conversations users had already flagged for errors. It also handles images better, answers STEM questions more reliably, and makes smarter decisions about when to use web search.

The same release introduced memory sources: visible context that shows users which saved memories or past conversations are shaping a response, with options to delete or correct them. GPT-5.5 Instant is also available in the API as chat-latest.

🫵 Why it matters to you: The default model matters more than most benchmark launches. It shapes what non-experts use, what your coworkers paste into workflows, and what everyone considers “normal” AI quality. If you maintain internal assistants, support bots, or research prompts, re-test old failure cases hallucination workarounds and “always search first” hacks may no longer be necessary.

🤫 The subtext nobody says out loud: The frontier race is loud. The default-model race is where habits form. OpenAI doesn’t need users to know which model they’re on. It just needs the default to feel good enough that they stop shopping around.

5. Anthropic bought more Claude capacity from SpaceX

On 6 May, Anthropic announced a compute deal with SpaceX, securing access to all capacity at the Colossus 1 data center, adding over 300 MW and more than 220,000 NVIDIA GPUs, available within the month.

Anthropic is using the headroom immediately: Claude Code’s five-hour rate limits are doubling for Pro, Max, Team, and Enterprise plans. Peak-hour throttling is gone for Pro and Max users. Claude Opus API rate limits are going up too.

🫵 Why it matters to you: If you use Claude Code or Opus for serious engineering or long-running agentic work, capacity is part of the product. Higher limits change whether a tool is “useful occasionally” or “viable as a daily workhorse.”

🤫 The subtext nobody says out loud: AI subscriptions are becoming compute allocation plans dressed as productivity tools. The next pricing war may not be $20 vs $30 per seat it may be about who gives your agents enough uninterrupted GPU time to actually finish the job.

6. DeepSeek got a reality check: strong, cheap, not frontier

On 1 May, NIST’s Center for AI Standards and Innovation published its evaluation of DeepSeek V4 Pro. The headline finding: DeepSeek V4 is the most capable PRC AI model CAISI has evaluated, but it lags leading U.S. frontier models by roughly 8 months. CAISI also found that DeepSeek’s own reported benchmarks put V4 closer to Opus 4.6 and GPT-5.4, while CAISI’s independent evaluations place it nearer to GPT-5.

The more useful finding is cost. Compared with GPT-5.4 mini, DeepSeek V4 was cheaper on 5 of 7 comparable benchmarks ranging from 53% less expensive to 41% more expensive depending on the task. Separately, DeepSeek is reportedly in advanced funding talks at around a $50B valuation, signalling that the market still sees real strategic value in capable open-weight models even when they trail the frontier.

🫵 Why it matters to you: Don’t blindly swap frontier APIs for open-weight models. Don’t ignore them either. The boring-but-correct move: build an eval set from your own prompts, measure quality, latency, refusal behaviour, tool use, and cost then route workloads based on results.

🤫 The subtext nobody says out loud: Open models don’t need to be best-in-class to pressure proprietary moats. They only need to be good enough for a large chunk of production workloads. The expensive frontier models hold the prestige tier. Cheaper open models eat the workhorse layer.

💡 Our take

Two of this week’s stories look unrelated but share the same logic.

Anthropic and OpenAI are selling finance agents complete with templates, connectors, governance, and implementation help. The Lightning compromise is attacking the same ecosystem from the other side: as AI infrastructure becomes more concentrated and more privileged, a single bad dependency can reach a lot of valuable systems very quickly.

The pattern is simple: higher leverage means faster wins and a faster blast radius.

The key signals from this week:

Domain-specific agent templates are becoming the default shape of enterprise AI. Generic agent platforms are giving way to packaged workflows with connectors, audit logs, permissions, and human review steps built in.
Supply-chain security is now core MLOps. ML packages are privileged infrastructure, not harmless notebook helpers.
Default models matter more than launch hype. GPT-5.5 Instant shifting the ChatGPT baseline affects more daily users than any frontier benchmark post.
Compute is still the product bottleneck. Anthropic’s SpaceX deal is a feature release powered by 220,000 GPUs.
Open-weight models are a persistent cost-pressure machine. DeepSeek may not be frontier, but “cheap and good enough” is a very dangerous position to compete against.

The big question to ask isn’t “which model is best?” It’s “which system gives us the best mix of quality, cost, control, security, and time to production?”

That answer is getting more situational every week.

📌 Your to-do list

Voice-agent build-vs-buy review. Add voice channels to your top internal workflows (support, sales, operations, compliance). Check if OpenAI’s new Realtime models + templates from partners now shortcut custom development.
Audit your lockfiles now. Search for lightning==2.6.2, lightning==2.6.3, and intercom-client@7.0.4. Treat affected environments as compromised, not merely outdated.
Move CI secrets to short-lived OIDC tokens. Long-lived cloud keys are exactly what import-time and install-time payloads are hunting.
Harden high-risk AI accounts. Use passkeys or hardware keys where possible. Tighten recovery. Review active sessions especially for accounts with Codex, cloud, repo, or production access.
Run a build-vs-buy review on your top three internal agent workflows. Focus on finance, compliance, reconciliation, research, procurement, and reporting. If a vendor now ships with the connectors you spent months building, you have a decision to make.
Re-test your ChatGPT workflows on GPT-5.5 Instant. Old verbosity constraints, hallucination workarounds, and “always search first” patterns may no longer be needed.
Benchmark DeepSeek V4 on real workloads, not vibes. Use your own eval set. Route easy or cost-sensitive tasks to cheaper models where they pass quality thresholds. Keep frontier models for high-stakes or high-ambiguity work.

See you next week.