Tuesday, 31 March 2026

New top story on Hacker News: Show HN: PhAIL – Real-robot benchmark for AI models
Show HN: PhAIL – Real-robot benchmark for AI models
7 by vertix | 7 comments on Hacker News.
I built this because I couldn't find honest numbers on how well VLA models [1] actually work on commercial tasks. I come from search ranking at Google where you measure everything, and in robotics nobody seemed to know. PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bin order picking – one of the most common warehouse operations. Same robot (Franka FR3), same objects, hundreds of blind runs. The operator doesn't know which model is running. Best model: 64 UPH. Human teleoperating the same robot: 330. Human by hand: 1,300+. Everything is public – every run with synced video and telemetry, the fine-tuning dataset, training scripts. The leaderboard is open for submissions. Happy to answer questions about methodology, the models, or what we observed. [1] Vision-Language-Action: https://ift.tt/V0ZmpGA

Monday, 30 March 2026

Sunday, 29 March 2026

Saturday, 28 March 2026

Friday, 27 March 2026

Thursday, 26 March 2026

New top story on Hacker News: Show HN: Orloj – agent infrastructure as code (YAML and GitOps)
Show HN: Orloj – agent infrastructure as code (YAML and GitOps)
7 by An0n_Jon | 3 comments on Hacker News.
Hey HN, we're Jon and Kristiane, and we're building Orloj ( https://orloj.dev ), an open-source (Apache 2.0) orchestration runtime for multi-agent AI systems. You define agents, tools, policies, and workflows in declarative YAML manifests, and Orloj handles scheduling, execution, governance, and reliability. We built this because running AI agents in production today looks a lot like running containers before Kubernetes: ad-hoc scripts, no governance, no observability, no standard way to manage the lifecycle of an agent fleet. Everyone we talked to was writing the same messy glue code to wire agents together, and nobody had a good answer for "which agent called which tool, and was it supposed to?" Orloj treats agents the way infrastructure-as-code treats cloud resources. You write a manifest that declares an agent's model, tools, permissions, and execution limits. You compose agents into directed graphs — pipelines, hierarchies, or swarm loops. The part we're most excited about is governance. AgentPolicy, AgentRole, and ToolPermission are evaluated inline during execution, before every agent turn and tool call. Instead of prompt instructions that the model might ignore, these policies are a runtime gate. Unauthorized actions fail closed with structured errors and full audit trails. You can set token budgets per run, whitelist models, block specific tools, and scope policies to individual agent systems. For reliability, we built lease-based task ownership (so crashed workers don't leave orphan tasks), capped exponential retry with jitter, idempotent replay, and dead-letter handling. The scheduler supports cron triggers and webhook-driven task creation. The architecture is a server/worker split. orlojd hosts the API, resource store (in-memory for dev, Postgres for production), and task scheduler. orlojworker instances claim and execute tasks, route model requests through a gateway (OpenAI, Anthropic, Ollama, etc.), and run tools in configurable isolation — direct, sandboxed, container, or WASM. For local development, you can run everything in a single process with orlojd --embedded-worker --storage-backend=memory. Tool isolation was important to us. A web search tool probably doesn't need sandboxing, but a code execution tool should run in a container with no network, a read-only filesystem, and a memory cap. You configure this per tool based on risk level, and the runtime enforces it. We also added native MCP support. You register an MCP server (stdio or HTTP), Orloj auto-discovers its tools, and they become first-class resources with governance applied. So you can connect something like the GitHub MCP server and still have policy enforcement over what agents are allowed to do with it. Three starter blueprints are included (pipeline, hierarchical, swarm-loop). Docs: https://docs.orloj.dev We're also building out starter templates for operational workflows where governance really matters. First on the roadmap: 1. Incident response triage, 2. Compliance evidence collector, 3. CVE investigation pipeline, and 4. Secret rotation auditor. We have 20 templates in mind and community contributions are welcome. We're a small team and this is v0.1.0, so there's a lot still on the roadmap — hosted cloud, compliance packaging, and more. But the full runtime is open source today and we'd love feedback on what we've built so far. What would you use this for? What's missing?

Wednesday, 25 March 2026

Tuesday, 24 March 2026

New top story on Hacker News: Tell HN: Litellm 1.82.7 and 1.82.8 on PyPI are compromised
Tell HN: Litellm 1.82.7 and 1.82.8 on PyPI are compromised
109 by dot_treo | 292 comments on Hacker News.
About an hour ago new versions have been deployed to PyPI. I was just setting up a new project, and things behaved weirdly. My laptop ran out of RAM, it looked like a forkbomb was running. I've investigated, and found that a base64 encoded blob has been added to proxy_server.py. It writes and decodes another file which it then runs. I'm in the process of reporting this upstream, but wanted to give everyone here a headsup. It is also reported in this issue: https://ift.tt/jyvVB5a

Sunday, 22 March 2026

Saturday, 21 March 2026

New top story on Hacker News: Show HN: Joonote – A note-taking app on your lock screen and notification panel
Show HN: Joonote – A note-taking app on your lock screen and notification panel
12 by kilgarenone | 3 comments on Hacker News.
I finally built this app after many years of being sick of unlocking my phone every goddamn time I need to take or view my notes. It particularly sucks when I'm doing my grocery and going down the list. I started building last year June. This is a native app written in Kotlin. And since I'm a 100% Web dev guy, I gotta say this wouldn't have been possible without this AI to assist me. So this isn't "vibe-coded". I simply used the chat interface in Gemini website, manually copy paste codes to build and integrate every single thing in the app! I used gemini to build it just because I was piggybacking on my last company's enterprise subscription. I personally didn't subscribe to any AI (and still don't cuz the free quota seems enough for me :) So I certainly have learnt alot about Android development, architecture patterns, Kotlin syntax, and obeying Google's whims. Can't say I love it all, but for the sake of this app, I will :) Anyway, I finally have the app I wish existed, and I'm using it everyday. It not only does the main thing I needed it to do, but there's also all this stuff: - Make your notes private if you don't want to show them on lock screen. - Create check/to-do lists. - Set one time or recurring reminders. - Full-text search your notes in the app. - Speech-to-text. - Organize your notes with custom or color labels. - Pin the app as a widget on your home screen. - You can auto backup and restore your notes on new install or Android device. - Works offline. - And no funny business happening in the background https://ift.tt/Ntw6cC5 It's 30-day trial, then a one-time $9.99 to go Pro forever. I would love you all to check it out, FWIW. Ok thanks!

Friday, 20 March 2026

Thursday, 19 March 2026

New top story on Hacker News: Show HN: Dumped Wix for an AI Edge agent so I never have to hire junior staff
Show HN: Dumped Wix for an AI Edge agent so I never have to hire junior staff
8 by axotopia | 10 comments on Hacker News.
I run a building design consultancy. I got tired of paying Wix $40/month for a brochure that couldn’t answer simple service questions, and me wasting hours on the same FAQs. So I killed it all and spent 4 months building a 'talker': https://axoworks.com The stack is completely duct-taped: Netlify’s 10s serverless timeout forced me to split the agent into three pieces: Brain (Edge), Hands (Browser), and Voice (Edge). I haven’t coded in 30 years. This was 3 steps forward, 2 steps back, heavily guided by AI. The fight that proved it worked: 2 weeks ago, a licensed architect attacked the bot, trying to prove my business model harms the profession. The AI (DeepSeek-R3) completely dismantled his arguments. It was hilariously caustic. Log: https://ift.tt/z4R1uw0... A few battle scars: * Web Speech API works fine, right up until someone speaks Chinese without toggling the language mode. Then it forcefully spits out English phonetic gibberish. Still a headache. * Liability is the killer. Hallucinate a building code clause? We’re dead. Insurance won’t touch us. * We publish the audit logs to keep ourselves honest and make sure the system stays hardened. Audit: https://ift.tt/0bQzJ73 The hardest part was getting the intent right: making one LLM pivot seamlessly from a warm principal’s tone with a homeowner, to a defensive bulldog when attacked by a peer. That took 2.5 months of tuning. We burn through tokens with an 'Eager RAG' hack (pre-fetching guesses) just to improve responsiveness. I also ripped out the “essential” persistent DBs—less than 5% of visitors ever return, so why bother? If a client drops mid-query, their session vanishes. No server-side queues. The point: To let me operate with a network of seasoned pros, and trim the fat. Try to break it. I’ll be in the comments. Kee

Wednesday, 18 March 2026

Tuesday, 17 March 2026

Sunday, 15 March 2026

Saturday, 14 March 2026

Friday, 13 March 2026

Thursday, 12 March 2026

Wednesday, 11 March 2026

Tuesday, 10 March 2026

Monday, 9 March 2026

Sunday, 8 March 2026

Saturday, 7 March 2026

Friday, 6 March 2026

Thursday, 5 March 2026

Tuesday, 3 March 2026

Sunday, 1 March 2026