<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Machine Learning Pills: Real-World]]></title><description><![CDATA[Discover real world applications or examples of the previously introduced theoretical content.]]></description><link>https://mlpills.substack.com/s/real-world-example</link><image><url>https://substackcdn.com/image/fetch/$s_!yCAU!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b8efe1d-e165-4098-9fcc-b465f7286f50_1063x1063.png</url><title>Machine Learning Pills: Real-World</title><link>https://mlpills.substack.com/s/real-world-example</link></image><generator>Substack</generator><lastBuildDate>Sun, 19 Apr 2026 20:12:18 GMT</lastBuildDate><atom:link href="https://mlpills.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[MLPills]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[mlpills@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[mlpills@substack.com]]></itunes:email><itunes:name><![CDATA[David Andrés]]></itunes:name></itunes:owner><itunes:author><![CDATA[David Andrés]]></itunes:author><googleplay:owner><![CDATA[mlpills@substack.com]]></googleplay:owner><googleplay:email><![CDATA[mlpills@substack.com]]></googleplay:email><googleplay:author><![CDATA[David Andrés]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[RW #9 - How Vector DBs Store 100M Embeddings on One Machine (and Still Search Fast)]]></title><description><![CDATA[The claim &#8220;Our vector database can handle 100 million embeddings on a single machine.&#8221; Sounds impressive, but the math doesn&#8217;t work using just naive storage. Let&#8217;s take a standard embedding: 768 dims, float32. That&#8217;s 3,072 bytes per vector. Multiply by 100 million: 307 GB. Just for the vectors. No index. No metadata. No IDs. No breathing room for the OS. Just raw floats sitting in memory.

Most machines have 64-128 GB of RAM. We&#8217;re 3-5&#215; over budget before we&#8217;ve even started. So how do production systems actually pull this off? The answer is a systems design pattern built on one key insight: you don&#8217;t need full-precision vectors in RAM. You need just enough precision to find candidates, then you refine.]]></description><link>https://mlpills.substack.com/p/rw-9-how-vector-dbs-store-100m-embeddings</link><guid isPermaLink="false">https://mlpills.substack.com/p/rw-9-how-vector-dbs-store-100m-embeddings</guid><dc:creator><![CDATA[Nino Risteski]]></dc:creator><pubDate>Sun, 11 Jan 2026 11:09:33 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/278be9f5-32fc-4ce8-ba81-083dfc9d9063_2784x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>I&#8217;m <a href="https://ninoworklog.substack.com/">Nino</a> and happy to share that I&#8217;ve joined MLPills to write about AI! I am a Founding ML Engineer at AIRecruitPro, an open-source contributor, tech writer and a self-taught builder who learns by creating things from the ground up. My work is at the intersection of AI systems, product engineering, and startups, where I focus on designing multimodal models, GPU-accelerated training, and ML infrastructure. I love building companies, solving hard problems, and turning ideas into real products that people use. Say Hi on <a href="https://x.com/ninoristeski">X</a> or visit my <a href="https://www.linkedin.com/in/nino-risteski/">LinkedIn</a>!</em></p><h1>&#128138; Pill of the week</h1><p>The claim <em>&#8220;Our vector database can handle 100 million embeddings on a single machine.&#8221; </em>Sounds impressive, but the math doesn&#8217;t work using just naive storage. Let&#8217;s take a standard embedding: 768 dims, float32. That&#8217;s 3,072 bytes per vector. Multiply by 100 million: <strong>307 GB. Just for the vectors. </strong>No index. No metadata. No IDs. No breathing room for the OS. Just raw floats sitting in memory.</p><p>Most machines have 64-128 GB of RAM. We&#8217;re 3-5&#215; over budget before we&#8217;ve even started. So how do production systems actually pull this off? The answer is a systems design pattern built on one key insight: <strong>you don&#8217;t need full-precision vectors in RAM</strong>. You need just enough precision to find candidates, then you refine.</p><p>The way is to:</p><ul><li><p><strong>Compress</strong> the vectors (307 GB &#8594; ~10 GB using Product Quantization)</p></li><li><p><strong>Partition</strong> the search space (don&#8217;t scan everything)</p></li><li><p><strong>Score cheaply</strong> on compressed codes</p></li><li><p><strong>Refine</strong> only the top candidates with full precision</p></li></ul><p>This post walks through the exact math, the compression techniques that make it possible, and a runnable demo you can use to verify the claims yourself. By the end, you&#8217;ll understand precisely where every byte goes and how to tune the tradeoffs for your own system.</p><h3>The Math</h3><p>Say you&#8217;re building a RAG system. You&#8217;ve chunked your document corpus, embedded each chunk with a model like OpenAI&#8217;s <code>text-embedding-3-small</code>, and now you need to store and search those vectors. Here are the realistic numbers: 100 million vectors, 768 dims each, stored as float32 (4 bytes per value).</p><p>The storage calculation:</p><pre><code><code>Memory = N &#215; d &#215; bytes_per_float
       = 100,000,000 &#215; 768 &#215; 4
       = 307,200,000,000 bytes
       = 307.2 GB</code></code></pre><p>That&#8217;s <strong>307 GB for the vectors alone</strong>.</p><p>But a working system also needs an index structure, HNSW graph edges or IVF posting lists, which add 10-100+ bytes per vector depending on the method. You need vector IDs to map results back to your documents, typically 8 bytes each. Metadata like timestamps, permissions, and filter fields pile on more. Then there&#8217;s allocator overhead, fragmentation, alignment, and padding, which eat another 5-15%. And if you care about reliability, you&#8217;re replicating the data, which doubles everything.</p><p>Now consider what you&#8217;re working with. A standard cloud VM comes with 64 GB of RAM. Memory-optimized instances give you 128-256 GB. High-memory machines can reach 512 GB or more, but you&#8217;ll pay dearly for them. The gap is brutal. So, a typical production machine has 64-128 GB of RAM, but our vectors alone need 307 GB.</p><p>Something has to give. Either we throw money at bigger machines (which doesn&#8217;t scale), distribute across many nodes (which adds latency and complexity), or we find a way to <strong>radically compress</strong> what we keep in RAM.</p><h3>Where Memory Actually Goes</h3><p>Before we can fix the problem, we need to understand where the bytes actually go. Not all memory is created equal; some components are compressible, others aren&#8217;t. Some scale linearly with N, others don&#8217;t. </p><p>Let&#8217;s break down the four main buckets:</p><ol><li><p>Vectors</p></li><li><p>Indexes</p></li><li><p>IDs and Metadata</p></li><li><p>Overhead</p></li></ol><h4>Vectors</h4><p>In an uncompressed system, this is where most of your RAM goes. The good news is that this is also the <a href="https://arxiv.org/html/2401.08281v2">most compressible part</a> of the system. The entire point of techniques like Product Quantization is to shrink this from hundreds of gigabytes to single digits. Everything else in this breakdown is noise compared to solving the vector storage problem.</p><h4>Indexes</h4><p>You can&#8217;t just store vectors in a flat array and scan all 100 million on every query. You need an index to narrow the search space. But indexes have their own memory footprint, and it varies dramatically by method.</p><ul><li><p><strong><a href="https://mlpills.substack.com/p/issue-117-scaling-vector-search-hnsw">HNSW</a></strong> (Hierarchical Navigable Small World) builds a graph where each vector connects to its <a href="https://arxiv.org/abs/1603.09320">approximate neighbors</a>. With a typical configuration like 32 edges per vector across multiple layers, you&#8217;re adding 128-256 bytes per vector just for graph connectivity. At 100M vectors, that&#8217;s another 12-25GB for the graph alone. HNSW gives excellent latency and recall, but it comes with a memory overhead.</p></li><li><p><strong>IVF</strong> (Inverted File Index) clusters vectors into partitions and stores posting lists of which vectors belong to each cluster. The overhead per vector is <a href="https://github.com/facebookresearch/faiss/wiki/The-index-factory">much smaller</a> but you pay for the cluster centroids and list management. For 100M vectors, IVF structures typically add 1-3GB depending on configuration.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eQ8D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ace9b8-3788-4e23-b2f3-3424e1f7a46e_2784x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eQ8D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ace9b8-3788-4e23-b2f3-3424e1f7a46e_2784x1536.png 424w, https://substackcdn.com/image/fetch/$s_!eQ8D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ace9b8-3788-4e23-b2f3-3424e1f7a46e_2784x1536.png 848w, https://substackcdn.com/image/fetch/$s_!eQ8D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ace9b8-3788-4e23-b2f3-3424e1f7a46e_2784x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!eQ8D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ace9b8-3788-4e23-b2f3-3424e1f7a46e_2784x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eQ8D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ace9b8-3788-4e23-b2f3-3424e1f7a46e_2784x1536.png" width="1456" height="803" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23ace9b8-3788-4e23-b2f3-3424e1f7a46e_2784x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:803,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8364066,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/183916031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ace9b8-3788-4e23-b2f3-3424e1f7a46e_2784x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eQ8D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ace9b8-3788-4e23-b2f3-3424e1f7a46e_2784x1536.png 424w, https://substackcdn.com/image/fetch/$s_!eQ8D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ace9b8-3788-4e23-b2f3-3424e1f7a46e_2784x1536.png 848w, https://substackcdn.com/image/fetch/$s_!eQ8D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ace9b8-3788-4e23-b2f3-3424e1f7a46e_2784x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!eQ8D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23ace9b8-3788-4e23-b2f3-3424e1f7a46e_2784x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The choice between HNSW and IVF often comes down to this tradeoff: HNSW is faster and more accurate but memory-heavy; IVF is leaner but requires more tuning to match HNSW&#8217;s recall.</p><p>You can check our previous issue on <a href="https://mlpills.substack.com/p/issue-117-scaling-vector-search-hnsw">HNSW</a>:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;31ae8996-9638-4a01-964a-d088b9c1e1ab&quot;,&quot;caption&quot;:&quot;&#128138; Pill of the week&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Issue #117 - Scaling Vector Search: HNSW and Approximate Search&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:38707812,&quot;name&quot;:&quot;David Andr&#233;s&quot;,&quot;bio&quot;:&quot;&#128188; Data Scientist &#8226; &#128013; Python enthusiast&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db6423b2-36bc-440c-be7d-b54be5bad1b0_1447x1448.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-01-03T08:01:19.204Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9fedcc0-2a3b-48a3-8dfc-f37856b71cb1_2752x1536.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://mlpills.substack.com/p/issue-117-scaling-vector-search-hnsw&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:183173876,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:8,&quot;comment_count&quot;:0,&quot;publication_id&quot;:1354140,&quot;publication_name&quot;:&quot;Machine Learning Pills&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!yCAU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b8efe1d-e165-4098-9fcc-b465f7286f50_1063x1063.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h4>IDs and metadata</h4><p>Every vector needs an identity. At minimum, you need a way to map search results back to your source documents. A 64-bit ID costs 8 bytes per vector, that&#8217;s 800MB at 100M scale. Sounds small compared to 307GB of vectors, but once you compress the vectors down to ~10GB, suddenly 800MB of IDs represents 8% of your memory budget!</p><p>Metadata compounds this. Timestamps for freshness filtering. Permission flags for access control. Chunk offsets for retrieval. Category tags for faceted search. Each field you add multiplies across 100M rows. A system with 32 bytes of metadata per vector adds another 3.2 GB. </p><h4>Overhead</h4><p>Memory <a href="https://jemalloc.net/jemalloc.3.html">allocators</a> don&#8217;t pack data perfectly. You lose bytes to alignment requirements (8 or 16 byte boundaries), internal fragmentation (allocated blocks are often larger than requested), and bookkeeping (the allocator itself needs to track what&#8217;s allocated where). In practice, expect 5-15% overhead on top of your calculations.</p><p><strong>The key insight</strong></p><blockquote><p>In an uncompressed system, vectors account for 80-90% of total memory. Index structures, IDs, metadata, and overhead split the remaining 10-20%. This means that you should <strong>compress the vectors first</strong>. If you can take vectors from 307GB to 10GB, you&#8217;ve solved 90% of the problem. The rest is optimization at the margins!</p></blockquote><div><hr></div><h1>&#8205;&#127891;Full Stack AI / LLM Engineering*</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/bundles/holiday-bundle?ref=3b122f" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EfhH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f2c8ec5-b178-4a16-b42b-0f23ef46500e_2752x1275.png 424w, https://substackcdn.com/image/fetch/$s_!EfhH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f2c8ec5-b178-4a16-b42b-0f23ef46500e_2752x1275.png 848w, https://substackcdn.com/image/fetch/$s_!EfhH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f2c8ec5-b178-4a16-b42b-0f23ef46500e_2752x1275.png 1272w, https://substackcdn.com/image/fetch/$s_!EfhH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f2c8ec5-b178-4a16-b42b-0f23ef46500e_2752x1275.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EfhH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f2c8ec5-b178-4a16-b42b-0f23ef46500e_2752x1275.png" width="1456" height="675" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f2c8ec5-b178-4a16-b42b-0f23ef46500e_2752x1275.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:675,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6996995,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://academy.towardsai.net/bundles/holiday-bundle?ref=3b122f&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/183173876?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5721b6-dc31-47b0-bd34-c59efdab4c3a_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!EfhH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f2c8ec5-b178-4a16-b42b-0f23ef46500e_2752x1275.png 424w, https://substackcdn.com/image/fetch/$s_!EfhH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f2c8ec5-b178-4a16-b42b-0f23ef46500e_2752x1275.png 848w, https://substackcdn.com/image/fetch/$s_!EfhH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f2c8ec5-b178-4a16-b42b-0f23ef46500e_2752x1275.png 1272w, https://substackcdn.com/image/fetch/$s_!EfhH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f2c8ec5-b178-4a16-b42b-0f23ef46500e_2752x1275.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://academy.towardsai.net/bundles/holiday-bundle?ref=3b122f&quot;,&quot;text&quot;:&quot;Claim the Holiday Bundle&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://academy.towardsai.net/bundles/holiday-bundle?ref=3b122f"><span>Claim the Holiday Bundle</span></a></p><p><em>*Sponsored: by purchasing any of their courses you would also be supporting MLPills.</em></p><div><hr></div><h3>The Core Idea: Multi-Stage Retrieval + Compression</h3><p>Think about what search actually requires. You have a query vector. You want the 10 or 100 most similar items. You don&#8217;t care about the precise distance to vector #47,382,019 in the middle of the ranking. You only care whether it&#8217;s in your top results or not.</p><p>Production vector search systems exploit this with a multi-stage pipeline. Each stage trades precision for speed, progressively narrowing the candidate set until only the final results need careful scoring.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpills.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Machine Learning Pills is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h4>Stage 1: Candidate generation</h4><p>The first stage doesn&#8217;t try to find the best results. It tries to find a reasonable superset that <em>contains</em> the best results. This is where approximate nearest neighbor (ANN) algorithms like IVF and HNSW earn their keep. IVF partitions the vector space into clusters and only searches the few clusters closest to your query &#8212; maybe 1-5% of the data. HNSW navigates a graph structure, hopping from node to node toward the query region without ever examining most of the corpus.</p><blockquote><p>The goal is to go from 100 million candidates to a few thousand &#8212; a 10,000&#215; reduction &#8212; while maintaining high probability that the true top results are somewhere in that shortlist. You&#8217;re not ranking here. You&#8217;re filtering.</p></blockquote><h4>Stage 2: Cheap scoring</h4><p>Now you have a few thousand candidates. You need to rank them, but you still don&#8217;t need full precision. This is where compressed distance calculations shine. With Product Quantization, each vector is represented as a short code &#8212; maybe 96 bytes instead of 3,072. Computing approximate distances from these codes is fast: just a series of table lookups and additions, no floating-point multiplications, no loading full vectors from memory.</p><p>You score all candidates with these compressed distances and keep the top 100 or so. The ranking won&#8217;t be perfect, some vectors will be slightly misordered due to quantization error but the rough ordering is preserved. The true top-10 results are almost certainly somewhere in your top-100 candidates.</p><h4>Stage 3: Refine</h4><p>If you need more precision, you can re-score your top candidates using the original float32 vectors. This means fetching 100 full vectors (307 KB) instead of 100 million (307 GB) &#8212; a perfectly tractable amount to load from SSD or a secondary store.</p><p>You compute exact distances, re-rank, and return the final top-10. This stage recovers most of the accuracy lost to compression, but it only runs on a tiny fraction of the data.</p><h4>Stage 4: Rerank</h4><p>Cross-encoder <a href="https://mlpills.substack.com/p/issue-115-reranking-in-your-rag-pipeline?utm_source=publication-search">rerankers</a> take your query and each candidate document as raw text, feeding them through a model that directly predicts relevance. This is far more accurate than any vector similarity &#8212; it can catch semantic nuances that embedding distance misses &#8212; but it&#8217;s expensive. Running a cross-encoder on 100 million documents is unthinkable. Running it on your top 20 candidates takes milliseconds.</p><blockquote><p>This is the pattern: use cheap, approximate methods to shrink the haystack, then apply expensive, precise methods to find the needle.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wVXv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0749b900-676f-414a-836d-c73040dfaca0_2784x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wVXv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0749b900-676f-414a-836d-c73040dfaca0_2784x1536.png 424w, https://substackcdn.com/image/fetch/$s_!wVXv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0749b900-676f-414a-836d-c73040dfaca0_2784x1536.png 848w, https://substackcdn.com/image/fetch/$s_!wVXv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0749b900-676f-414a-836d-c73040dfaca0_2784x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!wVXv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0749b900-676f-414a-836d-c73040dfaca0_2784x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wVXv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0749b900-676f-414a-836d-c73040dfaca0_2784x1536.png" width="1456" height="803" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0749b900-676f-414a-836d-c73040dfaca0_2784x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:803,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:9007439,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/183916031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0749b900-676f-414a-836d-c73040dfaca0_2784x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wVXv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0749b900-676f-414a-836d-c73040dfaca0_2784x1536.png 424w, https://substackcdn.com/image/fetch/$s_!wVXv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0749b900-676f-414a-836d-c73040dfaca0_2784x1536.png 848w, https://substackcdn.com/image/fetch/$s_!wVXv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0749b900-676f-414a-836d-c73040dfaca0_2784x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!wVXv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0749b900-676f-414a-836d-c73040dfaca0_2784x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Check our previous issue about reranking:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;a8ef1a95-5d0c-4817-b900-470f881b419c&quot;,&quot;caption&quot;:&quot;&#128138; Pill of the week&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Issue #115 - Reranking in your RAG pipeline&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:38707812,&quot;name&quot;:&quot;David Andr&#233;s&quot;,&quot;bio&quot;:&quot;&#128188; Data Scientist &#8226; &#128013; Python enthusiast&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db6423b2-36bc-440c-be7d-b54be5bad1b0_1447x1448.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-12-14T08:02:12.466Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a561814-77da-47cc-aae4-c5ac7ae51aeb_2752x1536.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://mlpills.substack.com/p/issue-115-reranking-in-your-rag-pipeline&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:181498812,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:7,&quot;comment_count&quot;:0,&quot;publication_id&quot;:1354140,&quot;publication_name&quot;:&quot;Machine Learning Pills&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!yCAU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b8efe1d-e165-4098-9fcc-b465f7286f50_1063x1063.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><strong>The pipeline in practice</strong></p><p>100M vectors &#8594; <strong>ANN filter</strong> &#8594; 5,000 candidates &#8594; <strong>PQ scoring</strong> &#8594; 100 candidates &#8594; <strong>exact refine</strong> &#8594; 20 candidates &#8594; <strong>cross-encoder</strong> &#8594; 10 results</p><p>This is how every production vector search system works: Pinecone, Weaviate, Qdrant, Milvus, pgvector at scale, you name it. The specific algorithms differ, the boundaries between stages blur, but the fundamental pattern is universal.</p><h3>Compression Options: From Simple to PQ</h3><p>We&#8217;ve established that vectors are the problem: 307 GB of float32 data that needs to fit in 64 GB of RAM. Now let&#8217;s look at the solutions, starting with the obvious approaches and building to the technique that actually works at scale.</p><h4>Simple approaches (and why they&#8217;re not enough)</h4><ul><li><p><strong>Float16 &#8212; half precision</strong></p></li></ul><p>The <a href="https://docs.weaviate.io/weaviate/configuration/compression/pq-compression">simplest compression</a>: cut your floats in half. Float16 uses 2 bytes instead of 4, giving you an immediate 2&#215; reduction.</p><pre><code><code>100M &#215; 768 &#215; 2 = 153.6 GB</code></code></pre><p>Better, but still 153 GB. You&#8217;ve gone from &#8220;impossible&#8221; to &#8220;still impossible.&#8221; Float16 is worth using as a baseline optimization &#8212; there&#8217;s rarely a good reason to keep float32 if your embeddings don&#8217;t need the precision, but it won&#8217;t solve the fundamental problem.</p><ul><li><p><strong>Scalar quantization (int8)</strong></p></li></ul><p>Take each float and map it to an 8-bit integer. You lose precision in the value range, but you cut storage to 1 byte per dimension.</p><pre><code><code>100M &#215; 768 &#215; 1 = 76.8 GB</code></code></pre><p>Now we&#8217;re at 77GB, a 4&#215; reduction from float32. This is actually usable on high-memory machines. Some production systems stop here, especially if they can afford 128-256 GB instances. But we&#8217;re targeting 64 GB or less, and we haven&#8217;t accounted for index overhead. Scalar quantization gets us closer, but it won&#8217;t deliver &#8220;100M vectors on a commodity machine.&#8221; We need something more aggressive.</p><h3>Product Quantization &#8212; the real enabler</h3><p>Product Quantization (PQ) is the <a href="https://arxiv.org/pdf/1102.3828">technique</a> that makes large-scale vector search feasible. It&#8217;s been the backbone of billion-scale systems since J&#233;gou, Douze, and Schmid introduced it in 2011, and it remains the dominant approach today.</p><p>The core idea is deceptively simple: <strong>don&#8217;t store vectors, store references to a codebook</strong>.</p><h4>Step 1: Split the vector into subvectors</h4><p>Take your 768-dim vector and divide it into <em>m</em> equal chunks. If m=96, each chunk contains 8 dims.</p><pre><code><code>Original: [v&#8321;, v&#8322;, v&#8323;, ..., v&#8327;&#8326;&#8328;]
                &#8595;
Subvectors: [v&#8321;...v&#8328;], [v&#8329;...v&#8321;&#8326;], [v&#8321;&#8327;...v&#8322;&#8324;], ..., [v&#8327;&#8326;&#8321;...v&#8327;&#8326;&#8328;]
            &#9492;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9496;  &#9492;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9496;  &#9492;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9496;       &#9492;&#9472;&#9472;&#9472;&#9472;&#9472;&#9516;&#9472;&#9472;&#9472;&#9472;&#9472;&#9496;
               s&#8321;         s&#8322;          s&#8323;                 s&#8329;&#8326;</code></code></pre><h4>Step 2: Learn a codebook for each subspace</h4><p>For each of the 96 subspaces, run k-means clustering on the training data to find <em>k</em> representative centroids. Typically k=256, which means each centroid can be identified by a single byte (2&#8312; = 256). After training, you have 96 codebooks, each containing 256 centroids of 8 dimensions.</p><h4>Step 3: Encode each vector as codebook indices</h4><p>For each vector, find the nearest centroid in each subspace and store only the index.</p><pre><code><code>Original subvector s&#8321; = [0.23, -0.41, 0.87, ...]  (8 floats = 32 bytes)
Nearest centroid in codebook&#8321; = index 147
Stored: just the byte "147"</code></code></pre><p>Repeat for all 96 subspaces. Your 768-dimensional vector is now 96 bytes.</p><h4>The compression math</h4><p>Component Calculation Size Original float32 100M &#215; 768 &#215; 4 bytes 307.2 GB PQ codes 100M &#215; 96 bytes 9.6 GB Codebooks 96 codebooks &#215; 256 centroids &#215; 8 dims &#215; 4 bytes 0.75 MB. The codebook overhead is negligible &#8212; less than a megabyte regardless of database size. It&#8217;s a fixed cost shared across all vectors. The PQ codes scale linearly, but at 96 bytes per vector instead of 3,072. That&#8217;s a <strong>32&#215; compression ratio</strong>.</p><p><strong>307 GB &#8594; 9.6 GB</strong></p><p>Now we&#8217;re talking! Single-digit gigabytes for the vector payload, with room left for index structures, IDs, and metadata.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Tcon!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83065a4-af0e-490a-b150-aa4a5b8742cf_2784x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Tcon!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83065a4-af0e-490a-b150-aa4a5b8742cf_2784x1536.png 424w, https://substackcdn.com/image/fetch/$s_!Tcon!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83065a4-af0e-490a-b150-aa4a5b8742cf_2784x1536.png 848w, https://substackcdn.com/image/fetch/$s_!Tcon!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83065a4-af0e-490a-b150-aa4a5b8742cf_2784x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!Tcon!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83065a4-af0e-490a-b150-aa4a5b8742cf_2784x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Tcon!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83065a4-af0e-490a-b150-aa4a5b8742cf_2784x1536.png" width="1456" height="803" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b83065a4-af0e-490a-b150-aa4a5b8742cf_2784x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:803,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8781656,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/183916031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83065a4-af0e-490a-b150-aa4a5b8742cf_2784x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Tcon!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83065a4-af0e-490a-b150-aa4a5b8742cf_2784x1536.png 424w, https://substackcdn.com/image/fetch/$s_!Tcon!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83065a4-af0e-490a-b150-aa4a5b8742cf_2784x1536.png 848w, https://substackcdn.com/image/fetch/$s_!Tcon!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83065a4-af0e-490a-b150-aa4a5b8742cf_2784x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!Tcon!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb83065a4-af0e-490a-b150-aa4a5b8742cf_2784x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>What you&#8217;re actually storing</strong></p><p>It&#8217;s worth being concrete about what the compressed representation looks like. Each vector becomes a sequence of 96 bytes:</p><pre><code><code>Vector #0:    [147, 23, 201, 88, 45, ..., 156]  &#8592; 96 code indices
Vector #1:    [92, 178, 34, 88, 212, ..., 41]
Vector #2:    [147, 55, 201, 12, 45, ..., 203]
...
Vector #99,999,999: [84, 23, 19, 241, 178, ..., 92]</code></code></pre><p>That&#8217;s it. No floats. Just bytes pointing into codebooks. The codebooks themselves sit in a small lookup table that fits in CPU cache.</p><h3>How search works on compressed vectors (ADC)</h3><p>Compression is useless if you can&#8217;t search efficiently. The magic of PQ is that you can compute approximate distances <em>directly on the codes</em> without ever reconstructing the original vectors. The technique is called Asymmetric Distance Computation (ADC). </p><p>When a query arrives, you don&#8217;t compress it. You keep all 768 float32 values. The asymmetry is intentional &#8212; you&#8217;re comparing one high-precision query against millions of low-precision database vectors.</p><p>Before scanning any codes, you split the query into the same 96 subvectors and compute the distance from each query subvector to all 256 centroids in the corresponding codebook.</p><pre><code><code>Query subvector q&#8321; = [0.15, -0.33, 0.91, ...]

Distance to centroid 0:   0.234
Distance to centroid 1:   0.891
Distance to centroid 2:   0.156
...
Distance to centroid 255: 0.445

Store as: lookup_table&#8321;[256]</code></code></pre><p>You build 96 such tables, one per subspace. Total work: 96 &#215; 256 = 24,576 distance calculations. This happens once per query, not once per vector.</p><p><strong>Distance = sum of table lookups</strong></p><p>Now the scan is trivial. For each database vector, you look up its code in each table and sum the results.</p><pre><code><code>Database vector codes: [147, 23, 201, ...]

Distance &#8776; lookup_table&#8321;[147] + lookup_table&#8322;[23] + lookup_table&#8323;[201] + ...</code></code></pre><p>That&#8217;s 96 table lookups and 95 additions per vector. No floating-point multiplications. No memory fetches beyond the codes themselves and the cached lookup tables.</p><h4>Why is it fast?</h4><p>Three reasons:</p><ol><li><p><strong>memory bandwidth</strong>. You&#8217;re reading 96 bytes per vector instead of 3,072. That&#8217;s 32&#215; less data moving from RAM to CPU. At scale, memory bandwidth is often the bottleneck, not compute.</p></li><li><p><strong>cache efficiency</strong>. The lookup tables (96 &#215; 256 &#215; 4 bytes &#8776; 96 KB) fit comfortably in L2 cache. The codes stream through sequentially. There&#8217;s no random access pattern to blow out your cache lines.</p></li><li><p><strong>simple operations</strong>. Table lookups and integer additions are about as cheap as it gets. No floating-point pipeline stalls, no branch mispredictions, no complex instruction sequences.</p></li></ol><p>The result: PQ-based scanning can be <a href="https://engineering.fb.com/2025/05/08/data-infrastructure/accelerating-gpu-indexes-in-faiss-with-nvidia-cuvs/">10-30&#215; faster </a>than brute-force float32 distance calculations, depending on hardware. Combined with an IVF index that limits which vectors you scan at all, you get practical query times even at 100M scale.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!w_xp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c081aec-4662-46f3-a12a-bae2e3719b17_2784x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!w_xp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c081aec-4662-46f3-a12a-bae2e3719b17_2784x1536.png 424w, https://substackcdn.com/image/fetch/$s_!w_xp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c081aec-4662-46f3-a12a-bae2e3719b17_2784x1536.png 848w, https://substackcdn.com/image/fetch/$s_!w_xp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c081aec-4662-46f3-a12a-bae2e3719b17_2784x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!w_xp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c081aec-4662-46f3-a12a-bae2e3719b17_2784x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!w_xp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c081aec-4662-46f3-a12a-bae2e3719b17_2784x1536.png" width="1456" height="803" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c081aec-4662-46f3-a12a-bae2e3719b17_2784x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:803,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6581043,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/183916031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c081aec-4662-46f3-a12a-bae2e3719b17_2784x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!w_xp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c081aec-4662-46f3-a12a-bae2e3719b17_2784x1536.png 424w, https://substackcdn.com/image/fetch/$s_!w_xp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c081aec-4662-46f3-a12a-bae2e3719b17_2784x1536.png 848w, https://substackcdn.com/image/fetch/$s_!w_xp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c081aec-4662-46f3-a12a-bae2e3719b17_2784x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!w_xp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c081aec-4662-46f3-a12a-bae2e3719b17_2784x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>The tradeoff</h4><p>None of this is free. PQ introduces quantization error &#8212; the distance you compute is an approximation of the true distance. Some vectors will be slightly misordered in your rankings.</p><blockquote><p>The key point: PQ gives you a knob to turn. More bytes per vector (higher <em>m</em>) means less error but more memory. Fewer bytes means more compression but more error. You choose the tradeoff that fits your constraints.</p></blockquote><h3>Index Choice Determines the Rest of the RAM Story</h3><p>Compressing vectors from 307 GB to 10 GB solves the dominant term. But you still can&#8217;t scan 100 million vectors on every query &#8212; even with PQ&#8217;s fast distance calculations, that&#8217;s too slow for production latencies.</p><ul><li><p><strong>HNSW &#8212; fast but memory heavy</strong></p></li></ul><p><a href="https://mlpills.substack.com/p/issue-117-scaling-vector-search-hnsw?utm_source=publication-search">HNSW</a> builds a graph where each vector connects to its approximate nearest neighbors. It&#8217;s the performance king: sub-millisecond queries, 95%+ recall, minimal tuning required. The cost is memory. Each vector stores neighbor lists across multiple graph layers. With typical parameters (M=32), you&#8217;re adding 20-30 bytes per vector just for graph edges. At 100M vectors, that&#8217;s <strong>25+ GB for the index alone</strong> on top of your vector storage. If you&#8217;ve compressed vectors to 10 GB with PQ, then add 25 GB for HNSW, you&#8217;re at 35 GB before IDs or metadata. Workable on a 64 GB machine, but tight.</p><ul><li><p><strong>IVF-PQ &#8212; the scale play</strong></p></li></ul><p>IVF takes a different approach: partition the space into clusters, then search only the relevant clusters. At index time, k-means creates <em>nlist</em> cluster centroids (typically 4,096-16,384). Each vector gets assigned to its nearest cluster. At query time, you find the <em>nprobe</em> closest clusters to your query and scan only those posting lists using PQ distance calculations. The memory overhead is minimal, just the centroids and some bookkeeping for posting lists. Maybe 200-300 MB total, regardless of database size.</p><h4>Why IVF-PQ wins at scale</h4><p>The math makes the choice obvious:</p><ul><li><p><strong>PQ + HNSW:</strong> 10 GB vectors + 25 GB graph = 35 GB</p></li><li><p><strong>PQ + IVF:</strong> 10 GB vectors + 0.3 GB index = 10.3 GB</p></li></ul><p>IVF-PQ uses <strong>3&#215; less memory</strong> than HNSW for the same compressed vectors. That&#8217;s the difference between &#8220;fits on a 32 GB machine&#8221; and &#8220;needs 64 GB minimum.&#8221;</p><p>The tradeoff is recall and tuning complexity. IVF-PQ typically achieves 80-90% recall@10 versus HNSW&#8217;s 95%+, and it requires more parameter tuning (nlist, nprobe) to get there. But for memory-constrained deployments at 100M+ scale, that tradeoff is almost always worth it.</p><h3>The Two-Tier Storage Pattern (Hot vs Cold)</h3><p>Here&#8217;s a secret about production vector databases: they don&#8217;t keep everything in RAM. They <a href="https://proceedings.neurips.cc/paper_files/paper/2019/file/09853c7fb1d3f8ee67a61b6bf4a7f8e6-Paper.pdf">keep</a> <em>just enough</em> in RAM to identify candidates, then fetch the rest on demand. This hot/cold split is what makes &#8220;100M on one machine&#8221; practical, not just theoretically possible.</p><ul><li><p><strong>Hot tier &#8212; what stays in RAM</strong></p></li></ul><p>The hot tier contains everything needed to answer the question: &#8220;which vectors are closest to this query?&#8221; That means PQ codes (~10 GB for 100M vectors), IVF cluster structures (~300 MB), and vector IDs to map results back to documents (~800 MB). Maybe a few bytes of metadata per vector for basic filtering. Call it 12-15 GB total. This is your working set. It needs to be in RAM because you&#8217;re scanning millions of PQ codes per query and you can&#8217;t afford disk latency in that loop.</p><ul><li><p><strong>Cold tier &#8212; what lives on SSD</strong></p></li></ul><p>Everything else goes to disk: Original float32 vectors, if you want an optional refinement stage. Full document text or chunk content for <a href="https://mlpills.substack.com/p/issue-82-introduction-to-agentic-69d?utm_source=publication-search">RAG</a> retrieval. Extended metadata &#8212; timestamps, permissions, tags, whatever your application needs. Audit logs, versioning information, anything that doesn&#8217;t need sub-millisecond access. SSDs are cheap and fast enough. Reading 100 vectors &#215; 3 KB each = 300 KB from NVMe takes under a millisecond. That&#8217;s negligible compared to network round-trips to your LLM.</p><h4>The pipeline in practice</h4><ol><li><p><strong>Hot path (RAM):</strong> IVF identifies relevant clusters, PQ scores candidates, returns top 1,000 vector IDs</p></li><li><p><strong>Cold path (SSD):</strong> Fetch original vectors for top 100, recompute exact distances, rerank</p></li><li><p><strong>Retrieval (SSD):</strong> Load actual document chunks for top 20 results</p></li><li><p><strong>Downstream:</strong> Send chunks to cross-encoder or <a href="https://mlpills.substack.com/p/issue-110-llm-workflow-patterns?utm_source=publication-search">LLM</a></p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NSzR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f5a68d2-88fb-441f-8814-ca754199bd14_2784x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NSzR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f5a68d2-88fb-441f-8814-ca754199bd14_2784x1536.png 424w, https://substackcdn.com/image/fetch/$s_!NSzR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f5a68d2-88fb-441f-8814-ca754199bd14_2784x1536.png 848w, https://substackcdn.com/image/fetch/$s_!NSzR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f5a68d2-88fb-441f-8814-ca754199bd14_2784x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!NSzR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f5a68d2-88fb-441f-8814-ca754199bd14_2784x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NSzR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f5a68d2-88fb-441f-8814-ca754199bd14_2784x1536.png" width="1456" height="803" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4f5a68d2-88fb-441f-8814-ca754199bd14_2784x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:803,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8667372,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/183916031?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f5a68d2-88fb-441f-8814-ca754199bd14_2784x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NSzR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f5a68d2-88fb-441f-8814-ca754199bd14_2784x1536.png 424w, https://substackcdn.com/image/fetch/$s_!NSzR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f5a68d2-88fb-441f-8814-ca754199bd14_2784x1536.png 848w, https://substackcdn.com/image/fetch/$s_!NSzR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f5a68d2-88fb-441f-8814-ca754199bd14_2784x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!NSzR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f5a68d2-88fb-441f-8814-ca754199bd14_2784x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Why this works for RAG</h4><p>RAG applications don&#8217;t actually need vectors at the end of the pipeline. They need text. Your user asks a question. You embed it, search vectors, and get back document IDs. Then you load those documents and feed them to an LLM. The vectors were just an intermediate step to find relevant content.</p><p>This means the cold tier isn&#8217;t optional overhead, it&#8217;s where your actual content lives. The hot tier is a compressed index <em>into</em> that content. Keeping them separate is natural, not a compromise.</p><p><strong>The budget reality</strong></p><p>With hot/cold separation, your RAM requirement drops dramatically:</p><p>Component Location Size PQ codes RAM 9.6 GB IVF structures RAM 0.3 GB Vector IDs RAM 0.8 GB Minimal metadata RAM 1-2 GB <strong>Hot tier total</strong> <strong>RAM</strong> <strong>~12 GB</strong> Original vectors SSD 307 GB Document chunks SSD Variable Extended metadata SSD Variable</p><p>A 32 GB machine handles the hot tier comfortably. A 2 TB SSD handles everything else. Total hardware cost: a few hundred dollars, not tens of thousands. This is why &#8220;100M on one machine&#8221; is feasible &#8212; not because we solved an impossible compression problem, but because we only keep the index hot and let everything else stay cold.</p><h3>Worked Memory Budget (The Proof)</h3><h4>The setup</h4><ul><li><p>100 million vectors</p></li><li><p>768 dimensions (standard embedding size)</p></li><li><p>PQ with m=96 subspaces, k=256 centroids</p></li><li><p>IVF with nlist=4,096 clusters</p></li><li><p><strong>PQ codes</strong> &#8212; the compressed vectors themselves. Each vector becomes 96 bytes (one byte per subspace). That&#8217;s 100,000,000 &#215; 96 = 9,600,000,000 bytes. <strong>9.6 GB.</strong></p></li><li><p><strong>Codebooks</strong> &#8212; the lookup tables for reconstruction and distance computation. You have 96 codebooks, each with 256 centroids of 8 dimensions stored as float32. That&#8217;s 96 &#215; 256 &#215; 8 &#215; 4 = 786,432 bytes. <strong>Under 1 MB.</strong> This is constant regardless of database size.</p></li><li><p><strong>Vector IDs</strong> &#8212; mapping search results back to documents. Using 64-bit integers for safety, that&#8217;s 100,000,000 &#215; 8 = 800,000,000 bytes. <strong>0.8 GB.</strong></p></li><li><p><strong>IVF structures</strong> &#8212; cluster centroids plus posting list overhead. The centroids themselves are 4,096 &#215; 768 &#215; 4 = 12.6 MB. Posting list bookkeeping (offsets, lengths) adds another few hundred MB. Call it <strong>0.5 GB</strong> total.</p></li><li><p><strong>Minimal metadata</strong> &#8212; basic fields you might filter on. Say 16 bytes per vector for timestamps, flags, and a category ID. That&#8217;s 100,000,000 &#215; 16 = 1,600,000,000 bytes. <strong>1.6 GB.</strong></p></li><li><p><strong>Allocator overhead</strong> &#8212; fragmentation, alignment, bookkeeping. Estimate 10% on top of everything. <strong>~1.2 GB.</strong></p></li></ul><h4>The total</h4><p>Component Size PQ codes 9.6 GB Codebooks &lt; 1 MB Vector IDs 0.8 GB IVF structures 0.5 GB Minimal metadata 1.6 GB Allocator overhead ~1.2 GB <strong>Hot tier total</strong> <strong>~13.7 GB. </strong>Round up for safety: <strong>15 GB</strong> for a fully operational 100M vector index.</p><h4>The comparison</h4><p>Raw float32 vectors alone would cost 307 GB. Our compressed, indexed system fits in 15 GB. That&#8217;s a <strong>20&#215; reduction</strong> &#8212; from &#8220;needs a server with half a terabyte of RAM&#8221; to &#8220;fits on a laptop.&#8221;</p><ul><li><p>A 32 GB machine runs this index comfortably with room for the OS, file caches, and your application code.</p></li><li><p>A 64 GB machine gives you headroom for growth, more metadata, or a hybrid HNSW layer for frequently-accessed vectors.</p></li><li><p>A 128 GB machine is overkill for vectors alone, but gives you space to keep original float32 vectors in RAM for refinement without hitting disk.</p></li></ul><p>The cold tier, the original vectors, full documents, and extended metadata  live on a 1-2 TB SSD. NVMe drives handle this for a few hundred dollars.</p><h3>Reproducible Demo</h3><p><a href="https://github.com/NinoRisteski/vectordb-100m">This repository</a> builds everything we&#8217;ve discussed: PQ compression, IVF indexing, recall measurement, and lets you confirm the numbers yourself. Run it at 1M vectors on your laptop, extrapolate to 100M, and see that the math holds.</p><ul><li><p><strong>What the demo measures</strong></p></li></ul><p>The script generates synthetic embeddings at feasible scale (1M vectors by default), builds both an exact index and a compressed IVF-PQ index, then compares them head-to-head:</p><p>Metric Exact (FlatL2) IVF-PQ Index size 2.86 GB 0.11 GB Search time (10K queries) 12.0s 1.4s Recall@10 100% 81.4%</p><p>The compression ratio is 26&#215; on the actual index files. The speedup is 8.5&#215;. The recall lands squarely in the 80-90% range we&#8217;ve been claiming.</p><ul><li><p><strong>Extrapolation to 100M</strong></p></li></ul><p>The script scales these measurements to show what 100M vectors would require: Component Size Raw float32 vectors 286 GB PQ codes only 8.9 GB Full IVF-PQ index ~10.9 GB. <strong>Fits in 64 GB RAM?</strong> <strong>YES</strong></p><h3>Three things to remember</h3><ol><li><p><strong>&#8220;100M vectors on one machine&#8221; is a systems design outcome, not a single algorithm.</strong> It&#8217;s the combination of compression, indexing, and tiered storage &#8212; each solving a different part of the problem. Skip any piece and the math falls apart.</p></li><li><p><strong>The winning recipe: partition &#8594; compress &#8594; shortlist &#8594; refine.</strong> IVF partitions the search space so you don&#8217;t scan everything. PQ compresses vectors so they fit in RAM. ANN search shortlists candidates cheaply. Exact refinement recovers precision where it matters. This pipeline is how every production system works at scale.</p></li><li><p><strong>PQ is the key that unlocks everything else.</strong> Without Product Quantization, you&#8217;re stuck at 307 GB with no path forward. With it, vectors drop to 10 GB and suddenly the rest of the system &#8212; indexing, metadata, hot/cold storage &#8212; becomes tractable. Compress the dominant term first; everything else follows.</p></li></ol><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://mlpills.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://mlpills.substack.com/subscribe?"><span>Subscribe now</span></a></p>]]></content:encoded></item><item><title><![CDATA[RW #8 - Turning Podcasts Into Knowledge Graphs]]></title><description><![CDATA[Picture this: You&#8217;ve just finished listening to a fascinating 2-hour podcast. The expert dropped dozens of insights, research findings, and connections between ideas. But now, trying to recall how concept A relates to concept B, or what specific recommendations were made, feels like searching for a needle in a haystack.What if you could instantly see all these connections laid out visually? What if you could query this knowledge like a database? Better yet, what if you could combine insights from hundreds of podcasts into a single, searchable knowledge network?]]></description><link>https://mlpills.substack.com/p/rw-8-turning-podcasts-into-knowledge</link><guid isPermaLink="false">https://mlpills.substack.com/p/rw-8-turning-podcasts-into-knowledge</guid><dc:creator><![CDATA[David Andrés]]></dc:creator><pubDate>Sun, 28 Sep 2025 07:02:10 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c37e1cfb-9db4-4052-90fd-f31574d39a66_1026x707.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1><strong>&#128138; Pill of the week</strong></h1><p>Picture this: You&#8217;ve just finished listening to a fascinating 2-hour podcast. The expert dropped dozens of insights, research findings, and connections between ideas. But now, trying to recall how concept A relates to concept B, or what specific recommendations were made, feels like searching for a needle in a haystack.</p><p>What if you could instantly see all these connections laid out visually? What if you could query this knowledge like a database? Better yet, what if you could combine insights from hundreds of podcasts into a single, searchable knowledge network?</p><p>That&#8217;s exactly what I built: an AI-powered system that transforms podcast transcriptions into living knowledge graphs. In one recent test with a medical podcast, it extracted 31 entities connected by 29 precise relationships&#8212;and the patterns it revealed were stunning.</p><p>But first, <strong>what is a Knowledge Graph?</strong></p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;ad2c1442-cad9-4687-aff7-c9be69991b59&quot;,&quot;caption&quot;:&quot;&#128138; Pill of the Week&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Issue #102 - Knowledge Graphs to make RAG smarter&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:38707812,&quot;name&quot;:&quot;David Andr&#233;s&quot;,&quot;bio&quot;:&quot;&#128188; Data Scientist &#8226; &#128013; Python enthusiast&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db6423b2-36bc-440c-be7d-b54be5bad1b0_1447x1448.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-07-26T11:45:05.841Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5319ab73-1b76-48b5-a2f4-0136e1630992_669x420.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://mlpills.substack.com/p/issue-102-knowledge-graphs-to-make&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:169295542,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:9,&quot;comment_count&quot;:0,&quot;publication_id&quot;:1354140,&quot;publication_name&quot;:&quot;Machine Learning Pills&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!YODk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dba4244-97d2-48f0-a2bb-b01c7ea74212_118x118.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h2>The Problem with Podcast Knowledge</h2><p>Podcasts have become our modern university. From Joe Rogan to Lex Fridman, from health experts to tech visionaries, thousands of hours of expertise flow through our earbuds daily. Yet this knowledge remains frustratingly ephemeral. We consume it linearly, remember fragments, and lose the connections.</p><p>I recently faced this challenge myself. After listening to several deep-dive podcasts on complex topics&#8212;whether medical, technological, or philosophical&#8212;I found myself struggling to remember how all the pieces fit together. Taking notes felt inadequate&#8212;I needed something more powerful. I needed a knowledge graph that could capture not just the facts, but the intricate web of relationships between ideas.</p><h2>Enter Neo4j: The Graph Database Revolution</h2><p>Before we dive into code, let&#8217;s talk about Neo4j, the technology that makes this magic possible.</p><p>Neo4j isn&#8217;t your typical database. While traditional databases store information in rigid tables (think Excel spreadsheets), Neo4j stores data as a network of connected nodes&#8212;just like how our brains actually work. In Neo4j:</p><ul><li><p><strong>Nodes</strong> represent entities (people, concepts, diseases, treatments)</p></li><li><p><strong>Relationships</strong> connect these nodes (CAUSES, TREATS, PREVENTS)</p></li><li><p><strong>Properties</strong> add details to both nodes and relationships</p></li></ul><p>Imagine Wikipedia, but where every link between articles is labeled with the type of connection. That&#8217;s Neo4j in a nutshell.</p><h3>Why Neo4j for Knowledge Graphs?</h3><p>I chose Neo4j because it speaks the language of relationships naturally. When a podcast expert explains complex relationships&#8212;whether it&#8217;s &#8220;X causes Y which leads to Z&#8221; or &#8220;A enables B but conflicts with C&#8221;&#8212;Neo4j can represent these exactly as stated:</p><pre><code><code>(Concept A)-[:CAUSES]-&gt;(Concept B)-[:LEADS_TO]-&gt;(Concept C)</code></code></pre><p>No complex joins, no foreign keys, just intuitive connections that mirror how we think.</p><h2>Neo4j Aura: Your Graph in the Cloud</h2><p>For this project, I&#8217;m using Neo4j Aura, the cloud-hosted version of Neo4j. It&#8217;s like having a graph database without the hassle of managing servers. Setting it up takes minutes:</p><ol><li><p>Sign up at <a href="https://neo4j.com/aura">neo4j.com/aura</a></p></li><li><p>Create a free instance (yes, they have a generous free tier)</p></li><li><p>Save your connection credentials</p></li><li><p>Access your database through the beautiful Aura Console</p></li></ol><p>The Aura Console includes Neo4j Browser, a powerful visualization tool where you can see your knowledge graph come alive. More on this later.</p><h2>The Journey: From Audio to Insight</h2><p>Let me walk you through how I built this system, sharing the actual code and the thinking behind each step.</p><h3>Step 1: Cleaning the Mess</h3><p>Podcast transcripts are messy. They&#8217;re littered with timestamps and speaker labels that look like this:</p><pre><code><code>6 (3m 55s): So when we talk about insulin resistance...
7 (3m 58s): Exactly, and that&#8217;s why visceral fat...</code></code></pre><p>Our first job is to clean this up. Here&#8217;s the function I wrote:</p><pre><code><code>import re

def read_and_clean_txt(file_path: str) -&gt; str:
    &#8220;&#8221;&#8220;
    Reads a text file and removes speaker/timestamp labels like &#8216;6 (3m 55s):&#8217;.
    &#8220;&#8221;&#8220;
    with open(file_path, &#8220;r&#8221;, encoding=&#8221;utf-8&#8221;) as f:
        text = f.read()

    # Match lines like &#8220;6 (3m 55s):&#8221; or &#8220;6 (1h 0m 32s):&#8221;
    pattern = r&#8217;^\d+ \((?:\d+h )?\d+m \d+s\):\s*&#8217;
    cleaned_text = re.sub(pattern, &#8216;&#8217;, text, flags=re.MULTILINE)
    return cleaned_text</code></code></pre><p>This regex pattern is like a smart filter that identifies and removes all the formatting noise, leaving us with pure conversational content. Why does this matter? Because when we feed this text to our AI, we want it focusing on &#8220;insulin resistance&#8221; and &#8220;heart disease,&#8221; not getting confused by &#8220;3m 55s.&#8221;</p><h3>Step 2: Connecting to Our Digital Brain</h3><p>Next, I set up connections to both OpenAI (our AI brain) and Neo4j Aura (our graph database):</p><pre><code><code>import os

# Your credentials (get these from OpenAI and Neo4j Aura)
os.environ[&#8217;OPENAI_API_KEY&#8217;] = &#8220;your-openai-key&#8221;
os.environ[&#8221;NEO4J_URI&#8221;] = &#8220;neo4j+s://your-instance.databases.neo4j.io&#8221;
os.environ[&#8221;NEO4J_USERNAME&#8221;] = &#8220;neo4j&#8221;
os.environ[&#8221;NEO4J_PASSWORD&#8221;] = &#8220;your-password&#8221;

from langchain_neo4j import Neo4jGraph
graph = Neo4jGraph(refresh_schema=True)</code></code></pre><p>Notice the URI format for Aura: <code>neo4j+s://</code> indicates a secure connection. Your Aura instance provides this connection string&#8212;it&#8217;s like the address of your personal knowledge vault in the cloud.</p><h3>Step 3: The AI Magic Happens</h3><p>This is where things get exciting. We&#8217;re going to use GPT-4 to read our transcript and automatically identify entities and relationships. It&#8217;s like having a brilliant research assistant who can read a document and instantly create a mind map:</p><pre><code><code>from langchain_experimental.graph_transformers import LLMGraphTransformer
from langchain_openai import ChatOpenAI
from langchain_core.documents import Document

# Initialize GPT-4
llm = ChatOpenAI(model_name=&#8221;gpt-4o&#8221;)

# Create our AI transformer
llm_transformer = LLMGraphTransformer(llm=llm)

# Load our cleaned transcript
text = read_and_clean_txt(&#8221;podcast_transcript.txt&#8221;)
documents = [Document(page_content=text)]

# The magic moment - AI extracts knowledge structure
graph_documents = await llm_transformer.aconvert_to_graph_documents(documents)

# Let&#8217;s see what it found!
print(f&#8221;Nodes:{graph_documents[0].nodes}&#8221;)
print(f&#8221;Relationships:{graph_documents[0].relationships}&#8221;)</code></code></pre><p>What&#8217;s happening under the hood here is remarkable. The AI is:</p><ol><li><p>Reading the entire transcript</p></li><li><p>Identifying key concepts and entities</p></li><li><p>Understanding how these concepts relate</p></li><li><p>Structuring this into a graph format</p></li></ol><p>To show you the power of this approach, here&#8217;s what the AI extracted from a recent medical podcast about heart health and nutrition:</p><p><strong>31 Entities Identified across various categories:</strong></p><ul><li><p><strong>Diseases</strong>: Heart Disease, Cardiomyopathy, Coronary Artery Disease, Diabetes</p></li><li><p><strong>Biological Substances</strong>: Insulin, Cholesterol, LDL, HDL, Omega-3, Ketones</p></li><li><p><strong>Medical Conditions</strong>: Insulin Resistance, Visceral Fat, Fatty Liver, Leaky Gut, Inflammation</p></li><li><p><strong>Interventions</strong>: Fasting, Exercise, Vitamin D3, Vitamin K2, Calcium Supplements</p></li></ul><p><strong>29 Relationships Discovered</strong>, revealing complex connections like:</p><ul><li><p>(Visceral Fat)-[:CAUSES]-&gt;(Heart Disease)</p></li><li><p>(Fasting)-[:ALLEVIATES]-&gt;(Insulin Resistance)</p></li><li><p>(Omega-3)-[:REDUCES]-&gt;(Inflammation)</p></li><li><p>(Calcium Supplements)-[:INCREASES]-&gt;(Cardiovascular Events)</p></li></ul><p>Notice that last one&#8212;the AI picked up a counterintuitive finding about calcium supplements that many people might miss in a linear conversation. This is the power of AI-driven knowledge extraction: it captures both obvious and subtle relationships.</p><p>For a technology podcast, you might see entities like &#8220;Machine Learning,&#8221; &#8220;Neural Networks,&#8221; &#8220;GPU Computing&#8221; with relationships like &#8220;ENABLES,&#8221; &#8220;OPTIMIZES,&#8221; or &#8220;REQUIRES.&#8221; For a business podcast, expect &#8220;Market Strategy,&#8221; &#8220;Customer Acquisition,&#8221; &#8220;Revenue Models&#8221; connected by &#8220;DRIVES,&#8221; &#8220;IMPACTS,&#8221; or &#8220;DEPENDS_ON.&#8221;</p><h3>Step 4: Storing Our Knowledge in Neo4j</h3><p>Now we push this extracted knowledge into Neo4j Aura:</p><pre><code><code># Clear any existing data (careful with this in production!)
graph.query(&#8221;MATCH (n) DETACH DELETE n&#8221;)

# Store our knowledge graph
graph.add_graph_documents(graph_documents, baseEntityLabel=True)</code></code></pre><p>That simple <code>add_graph_documents</code> command creates potentially hundreds of nodes and relationships in your Neo4j database. It&#8217;s like watching a constellation of knowledge form instantly.</p><h2>Visualizing in Neo4j Aura</h2><p>Here&#8217;s the moment of truth. Open your Neo4j Aura Console and navigate to Neo4j Browser. This is where your knowledge graph transforms from data into insight.</p><h3>Exploring Your Graph in Aura</h3><p>In the Neo4j Browser query bar, you can explore your knowledge graph with Cypher queries. Here are some powerful patterns:</p><p><strong>See everything:</strong></p><pre><code><code>MATCH (n)-[r]-&gt;(m)
RETURN n, r, m;</code></code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q9n2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3095523-fc5d-48d8-a83e-c51735fd970e_1253x515.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q9n2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3095523-fc5d-48d8-a83e-c51735fd970e_1253x515.png 424w, https://substackcdn.com/image/fetch/$s_!Q9n2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3095523-fc5d-48d8-a83e-c51735fd970e_1253x515.png 848w, https://substackcdn.com/image/fetch/$s_!Q9n2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3095523-fc5d-48d8-a83e-c51735fd970e_1253x515.png 1272w, https://substackcdn.com/image/fetch/$s_!Q9n2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3095523-fc5d-48d8-a83e-c51735fd970e_1253x515.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q9n2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3095523-fc5d-48d8-a83e-c51735fd970e_1253x515.png" width="1253" height="515" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3095523-fc5d-48d8-a83e-c51735fd970e_1253x515.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:515,&quot;width&quot;:1253,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:81644,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/174707914?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3095523-fc5d-48d8-a83e-c51735fd970e_1253x515.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q9n2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3095523-fc5d-48d8-a83e-c51735fd970e_1253x515.png 424w, https://substackcdn.com/image/fetch/$s_!Q9n2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3095523-fc5d-48d8-a83e-c51735fd970e_1253x515.png 848w, https://substackcdn.com/image/fetch/$s_!Q9n2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3095523-fc5d-48d8-a83e-c51735fd970e_1253x515.png 1272w, https://substackcdn.com/image/fetch/$s_!Q9n2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3095523-fc5d-48d8-a83e-c51735fd970e_1253x515.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Trace causal pathways:</strong></p>
      <p>
          <a href="https://mlpills.substack.com/p/rw-8-turning-podcasts-into-knowledge">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[RW #7 - When to use Rules, ML or LLMs?]]></title><description><![CDATA[Deciding between a simple, rule-based system and a sophisticated machine learning (ML) model is a critical choice in software development. While it's tempting to jump to the latest AI, often a few well-written if-then statements are more effective. Here&#8217;s how to know when you truly need to make the leap to ML.]]></description><link>https://mlpills.substack.com/p/rw-7-when-to-use-rules-ml-or-llms</link><guid isPermaLink="false">https://mlpills.substack.com/p/rw-7-when-to-use-rules-ml-or-llms</guid><dc:creator><![CDATA[David Andrés]]></dc:creator><pubDate>Sat, 20 Sep 2025 08:20:55 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ce173065-b745-430b-885b-160ca54e1d66_1026x707.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1><strong>&#128138; Pill of the week</strong></h1><p>Deciding between a simple, rule-based system and a sophisticated machine learning (ML) model is a critical choice in software development. While it's tempting to jump to the latest AI, often a few well-written <code>if-then</code> statements are more effective. Here&#8217;s how to know when you truly need to make the leap to ML.</p><blockquote><p><em>Your <strong>likes</strong> </em>&#10084;&#65039;<em> and <strong>shares</strong> </em>&#128260;<em> fuel my work and help me keep bringing you the <strong>best content</strong>!</em></p><p><em>Thank you, David</em></p></blockquote><h2>The Power and Place of Simple Rules</h2><p>A <strong>rule-based system</strong> operates on a set of handcrafted, deterministic logic. If a specific condition is met, a specific action is taken. Think of a simple email filter: </p><p><code>IF subject CONTAINS "you've won" THEN move to Spam</code>.</p><p>You should stick with simple rules when your problem has:</p><ul><li><p><strong>High Explainability:</strong> The logic is transparent. You can trace exactly why the system made a particular decision. This is crucial for applications like tax calculations or regulatory compliance.</p></li><li><p><strong>A Manageable Number of Conditions:</strong> The logic can be described in a few dozen or even a few hundred <code>if-else</code> statements without becoming a tangled mess.</p></li><li><p><strong>Deterministic and Stable Environment:</strong> The underlying patterns don't change. Password strength requirements, for example, are set by policy and don't evolve on their own.</p></li><li><p><strong>No Need for Nuance:</strong> The inputs are straightforward and unambiguous. A transaction is either over $10,000 or it isn't; there's no gray area.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!C47s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cef249d-d491-41c6-a97f-589968b45ba4_915x281.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C47s!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cef249d-d491-41c6-a97f-589968b45ba4_915x281.png 424w, https://substackcdn.com/image/fetch/$s_!C47s!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cef249d-d491-41c6-a97f-589968b45ba4_915x281.png 848w, https://substackcdn.com/image/fetch/$s_!C47s!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cef249d-d491-41c6-a97f-589968b45ba4_915x281.png 1272w, https://substackcdn.com/image/fetch/$s_!C47s!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cef249d-d491-41c6-a97f-589968b45ba4_915x281.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C47s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cef249d-d491-41c6-a97f-589968b45ba4_915x281.png" width="915" height="281" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6cef249d-d491-41c6-a97f-589968b45ba4_915x281.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:281,&quot;width&quot;:915,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:67118,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/174081629?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cef249d-d491-41c6-a97f-589968b45ba4_915x281.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C47s!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cef249d-d491-41c6-a97f-589968b45ba4_915x281.png 424w, https://substackcdn.com/image/fetch/$s_!C47s!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cef249d-d491-41c6-a97f-589968b45ba4_915x281.png 848w, https://substackcdn.com/image/fetch/$s_!C47s!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cef249d-d491-41c6-a97f-589968b45ba4_915x281.png 1272w, https://substackcdn.com/image/fetch/$s_!C47s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6cef249d-d491-41c6-a97f-589968b45ba4_915x281.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In short, if you can clearly write down the logic in a flowchart, simple rules are often your best bet. They are cheap to build, fast to run, and easy to debug.</p><h2>When to Bring in Machine Learning</h2><p>Machine Learning becomes necessary when the limitations of a rule-based system are exceeded. ML models learn patterns from data rather than being explicitly programmed. This makes them ideal for problems that are too complex, nuanced, or dynamic for manual rules.</p><p>You need ML when your problem involves:</p><ul><li><p><strong>Unmanageable Complexity:</strong> The number of potential <code>if-then</code> rules would be in the millions or billions. Consider identifying a cat in a photo. You can't write rules for every possible combination of pixels, lighting, and angles. An ML model, however, can learn the general "pattern" of a cat from thousands of examples.</p></li><li><p><strong>Handling Nuance and Ambiguity:</strong> Simple rules fail when context is key. Sentiment analysis is a classic example. The statement "This horror movie was sick!" is positive, but a rule-based system might flag the word "sick" as negative. An ML model trained on modern language can understand this nuance.</p></li><li><p><strong>Adaptability and Scalability:</strong> The world changes, and your system needs to adapt. A spam filter with hardcoded rules will quickly become obsolete as spammers change their tactics. An ML model can be continuously retrained on new examples of spam to stay effective. This ability to learn from new data is ML's greatest strength.</p></li><li><p><strong>Personalization:</strong> The "rules" need to be different for every user. A recommendation engine like Netflix's can't use a global set of rules. It builds a personalized model for each user based on their unique viewing history, a task impossible to manage with <code>if-then</code> logic.</p></li><li><p><strong>Finding Hidden Patterns:</strong> Often, the most valuable insights are hidden in noisy data. ML is essential for tasks like fraud detection, where criminals are actively trying to disguise their behavior. A model can identify subtle, multi-variate patterns across thousands of transactions that no human could ever spot or code as a rule.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Dw_M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f6dccb-82de-4052-944e-aae2db9cd8a8_922x280.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Dw_M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f6dccb-82de-4052-944e-aae2db9cd8a8_922x280.png 424w, https://substackcdn.com/image/fetch/$s_!Dw_M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f6dccb-82de-4052-944e-aae2db9cd8a8_922x280.png 848w, https://substackcdn.com/image/fetch/$s_!Dw_M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f6dccb-82de-4052-944e-aae2db9cd8a8_922x280.png 1272w, https://substackcdn.com/image/fetch/$s_!Dw_M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f6dccb-82de-4052-944e-aae2db9cd8a8_922x280.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Dw_M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f6dccb-82de-4052-944e-aae2db9cd8a8_922x280.png" width="922" height="280" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7f6dccb-82de-4052-944e-aae2db9cd8a8_922x280.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:280,&quot;width&quot;:922,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:71729,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/174081629?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f6dccb-82de-4052-944e-aae2db9cd8a8_922x280.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Dw_M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f6dccb-82de-4052-944e-aae2db9cd8a8_922x280.png 424w, https://substackcdn.com/image/fetch/$s_!Dw_M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f6dccb-82de-4052-944e-aae2db9cd8a8_922x280.png 848w, https://substackcdn.com/image/fetch/$s_!Dw_M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f6dccb-82de-4052-944e-aae2db9cd8a8_922x280.png 1272w, https://substackcdn.com/image/fetch/$s_!Dw_M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7f6dccb-82de-4052-944e-aae2db9cd8a8_922x280.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The New Frontier: LLMs and Agents</h2><p><strong>Large Language Models (LLMs)</strong> and agents built on top of them represent a major evolution in this paradigm. They don't just solve complex problems; they handle tasks that are fundamentally about reasoning, understanding, and generating human-like responses.</p><h3>LLMs as Ultimate Pattern Recognizers</h3><p>LLMs are a form of machine learning specifically designed for the most complex and nuanced data of all: <strong>unstructured language</strong>. You would use an LLM instead of rules or even traditional ML for tasks that rely on a deep understanding of context, semantics, and intent.</p><ul><li><p><strong>Summarizing a document:</strong> There are no simple rules for creating a good summary. It requires understanding the key topics, their relationships, and rephrasing them concisely.</p></li><li><p><strong>Powering a chatbot:</strong> A rule-based chatbot is brittle and frustrating ("I'm sorry, I don't understand that."). An LLM-powered chatbot can handle a vast range of conversational topics and user intents.</p></li><li><p><strong>Semantic search:</strong> Instead of just matching keywords (a rule), an LLM can understand the <em>meaning</em> behind a query. Searching for "movies about fighting a corrupt government" can return <em>V for Vendetta</em>, even if those exact words don't appear in the description.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oFcg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5bfd6f-db89-484b-adfb-b1f506d63862_916x289.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oFcg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5bfd6f-db89-484b-adfb-b1f506d63862_916x289.png 424w, https://substackcdn.com/image/fetch/$s_!oFcg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5bfd6f-db89-484b-adfb-b1f506d63862_916x289.png 848w, https://substackcdn.com/image/fetch/$s_!oFcg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5bfd6f-db89-484b-adfb-b1f506d63862_916x289.png 1272w, https://substackcdn.com/image/fetch/$s_!oFcg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5bfd6f-db89-484b-adfb-b1f506d63862_916x289.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oFcg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5bfd6f-db89-484b-adfb-b1f506d63862_916x289.png" width="916" height="289" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ff5bfd6f-db89-484b-adfb-b1f506d63862_916x289.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:289,&quot;width&quot;:916,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:70450,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/174081629?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5bfd6f-db89-484b-adfb-b1f506d63862_916x289.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oFcg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5bfd6f-db89-484b-adfb-b1f506d63862_916x289.png 424w, https://substackcdn.com/image/fetch/$s_!oFcg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5bfd6f-db89-484b-adfb-b1f506d63862_916x289.png 848w, https://substackcdn.com/image/fetch/$s_!oFcg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5bfd6f-db89-484b-adfb-b1f506d63862_916x289.png 1272w, https://substackcdn.com/image/fetch/$s_!oFcg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff5bfd6f-db89-484b-adfb-b1f506d63862_916x289.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Agents: Moving from Prediction to Action</h3><p><strong>Agents</strong> are systems that use LLMs as their "brain" to make decisions and take actions. They are the ultimate escape from a fixed set of rules. An agent can be given a high-level goal, and it will reason about the steps needed to achieve it.</p><p>For example, a rule-based system for booking travel might look like this: </p><p><code>IF destination is Paris AND hotel_rating &gt; 4 AND price &lt; 200 THEN suggest_booking</code>.</p><p>An <strong>agent</strong>, on the other hand, can handle a vague prompt like: "<strong>Plan a 3-day weekend trip to Paris for me next month. I like art museums and want to stay somewhere with character under &#8364;200 a night.</strong>"</p><p>The agent would use its LLM core to:</p><ol><li><p><strong>Deconstruct the Goal:</strong> Identify key constraints (Paris, next month, art, character, &lt;&#8364;200).</p></li><li><p><strong>Formulate a Plan:</strong> Search for flights, browse hotels that match the description "character," find top art museums, check their opening hours, etc.</p></li><li><p><strong>Execute Tools:</strong> Interact with APIs for airlines, hotels, and maps.</p></li><li><p><strong>Synthesize and Respond:</strong> Present a coherent itinerary to the user.</p></li></ol><p>In this scenario, writing rules would be impossible. The agent's behavior is dynamic, goal-oriented, and emergent, making it the perfect tool for complex, multi-step problems that defy rigid logic.</p><h2>A simple decision framework</h2><p>Before you choose, answer three questions in order.</p><ol><li><p><strong>Can I meet the requirement with fixed logic that a non-expert can read in one sitting?</strong><br>If yes, write rules. If the logic is longer than a few pages and you struggle to name each block neatly, you are probably forcing rules to do a job they are not suited for.</p></li><li><p><strong>Do I have stable labels or a clear signal to learn from?<br></strong>If yes, classic ML is a strong candidate. If labels are scarce but text and context are rich, an LLM with retrieval can be viable. If you have neither labels nor context, collect them first before choosing a model.</p></li><li><p><strong>What is the operational tolerance for mistakes, latency, and drift?<br></strong>High-risk and low-latency contexts favour simple rules or transparent ML. Open-ended, low-risk experiences can accept LLMs and agents.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eNP9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad4215d7-380e-4deb-b07f-094da9435acf_750x323.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eNP9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad4215d7-380e-4deb-b07f-094da9435acf_750x323.png 424w, https://substackcdn.com/image/fetch/$s_!eNP9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad4215d7-380e-4deb-b07f-094da9435acf_750x323.png 848w, https://substackcdn.com/image/fetch/$s_!eNP9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad4215d7-380e-4deb-b07f-094da9435acf_750x323.png 1272w, https://substackcdn.com/image/fetch/$s_!eNP9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad4215d7-380e-4deb-b07f-094da9435acf_750x323.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eNP9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad4215d7-380e-4deb-b07f-094da9435acf_750x323.png" width="750" height="323" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad4215d7-380e-4deb-b07f-094da9435acf_750x323.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:323,&quot;width&quot;:750,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:40778,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/174081629?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad4215d7-380e-4deb-b07f-094da9435acf_750x323.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eNP9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad4215d7-380e-4deb-b07f-094da9435acf_750x323.png 424w, https://substackcdn.com/image/fetch/$s_!eNP9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad4215d7-380e-4deb-b07f-094da9435acf_750x323.png 848w, https://substackcdn.com/image/fetch/$s_!eNP9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad4215d7-380e-4deb-b07f-094da9435acf_750x323.png 1272w, https://substackcdn.com/image/fetch/$s_!eNP9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad4215d7-380e-4deb-b07f-094da9435acf_750x323.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vUiq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d9f49c-c277-4274-a2e5-297f6a626889_752x311.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vUiq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d9f49c-c277-4274-a2e5-297f6a626889_752x311.png 424w, https://substackcdn.com/image/fetch/$s_!vUiq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d9f49c-c277-4274-a2e5-297f6a626889_752x311.png 848w, https://substackcdn.com/image/fetch/$s_!vUiq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d9f49c-c277-4274-a2e5-297f6a626889_752x311.png 1272w, https://substackcdn.com/image/fetch/$s_!vUiq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d9f49c-c277-4274-a2e5-297f6a626889_752x311.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vUiq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d9f49c-c277-4274-a2e5-297f6a626889_752x311.png" width="752" height="311" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8d9f49c-c277-4274-a2e5-297f6a626889_752x311.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:311,&quot;width&quot;:752,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43062,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/174081629?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d9f49c-c277-4274-a2e5-297f6a626889_752x311.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vUiq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d9f49c-c277-4274-a2e5-297f6a626889_752x311.png 424w, https://substackcdn.com/image/fetch/$s_!vUiq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d9f49c-c277-4274-a2e5-297f6a626889_752x311.png 848w, https://substackcdn.com/image/fetch/$s_!vUiq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d9f49c-c277-4274-a2e5-297f6a626889_752x311.png 1272w, https://substackcdn.com/image/fetch/$s_!vUiq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8d9f49c-c277-4274-a2e5-297f6a626889_752x311.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3mp9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6bdec3a-9193-40f2-99f6-fded2ad445ff_750x293.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3mp9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6bdec3a-9193-40f2-99f6-fded2ad445ff_750x293.png 424w, https://substackcdn.com/image/fetch/$s_!3mp9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6bdec3a-9193-40f2-99f6-fded2ad445ff_750x293.png 848w, https://substackcdn.com/image/fetch/$s_!3mp9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6bdec3a-9193-40f2-99f6-fded2ad445ff_750x293.png 1272w, https://substackcdn.com/image/fetch/$s_!3mp9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6bdec3a-9193-40f2-99f6-fded2ad445ff_750x293.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3mp9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6bdec3a-9193-40f2-99f6-fded2ad445ff_750x293.png" width="750" height="293" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d6bdec3a-9193-40f2-99f6-fded2ad445ff_750x293.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:293,&quot;width&quot;:750,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42473,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/174081629?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6bdec3a-9193-40f2-99f6-fded2ad445ff_750x293.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3mp9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6bdec3a-9193-40f2-99f6-fded2ad445ff_750x293.png 424w, https://substackcdn.com/image/fetch/$s_!3mp9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6bdec3a-9193-40f2-99f6-fded2ad445ff_750x293.png 848w, https://substackcdn.com/image/fetch/$s_!3mp9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6bdec3a-9193-40f2-99f6-fded2ad445ff_750x293.png 1272w, https://substackcdn.com/image/fetch/$s_!3mp9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6bdec3a-9193-40f2-99f6-fded2ad445ff_750x293.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="poll-embed" data-attrs="{&quot;id&quot;:378535}" data-component-name="PollToDOM"></div><blockquote><p>Remember, your <strong>likes</strong> &#10084;&#65039; and <strong>shares</strong> &#128260; help me keep this going and bring you the best content week after week. </p></blockquote><div><hr></div><h1><strong>&#128214; Book of the Week</strong></h1><p><em><strong>&#8220;<a href="https://www.packtpub.com/en-es/product/ai-agents-in-practice-9781805801351">AI Agents in Practice</a>&#8221;</strong></em> by Valentina Alto</p><p>Go beyond simple chatbots&#8212;this book is your guide to building <strong>production-ready AI agents</strong> that plan, reason, collaborate, and deliver real value.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xlgx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63fae90a-b72c-4839-ba51-dd5f4d847926_2250x2775" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xlgx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63fae90a-b72c-4839-ba51-dd5f4d847926_2250x2775 424w, https://substackcdn.com/image/fetch/$s_!Xlgx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63fae90a-b72c-4839-ba51-dd5f4d847926_2250x2775 848w, https://substackcdn.com/image/fetch/$s_!Xlgx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63fae90a-b72c-4839-ba51-dd5f4d847926_2250x2775 1272w, https://substackcdn.com/image/fetch/$s_!Xlgx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63fae90a-b72c-4839-ba51-dd5f4d847926_2250x2775 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xlgx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63fae90a-b72c-4839-ba51-dd5f4d847926_2250x2775" width="446" height="550.1483516483516" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63fae90a-b72c-4839-ba51-dd5f4d847926_2250x2775&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1796,&quot;width&quot;:1456,&quot;resizeWidth&quot;:446,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;AI Agents in Practice&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="AI Agents in Practice" title="AI Agents in Practice" srcset="https://substackcdn.com/image/fetch/$s_!Xlgx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63fae90a-b72c-4839-ba51-dd5f4d847926_2250x2775 424w, https://substackcdn.com/image/fetch/$s_!Xlgx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63fae90a-b72c-4839-ba51-dd5f4d847926_2250x2775 848w, https://substackcdn.com/image/fetch/$s_!Xlgx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63fae90a-b72c-4839-ba51-dd5f4d847926_2250x2775 1272w, https://substackcdn.com/image/fetch/$s_!Xlgx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63fae90a-b72c-4839-ba51-dd5f4d847926_2250x2775 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>What you&#8217;ll learn</h4><ul><li><p>Design and deploy single- and multi-agent systems with frameworks like <strong>LangChain</strong> and <strong>LangGraph</strong></p></li><li><p>Build agents with memory, context, and tool integrations</p></li><li><p>Apply orchestration patterns and industry-specific case studies</p></li><li><p>Implement <strong>ethical safeguards</strong> and optimize for performance in production</p></li></ul><h4>Who it&#8217;s for</h4><ul><li><p><strong>AI engineers &amp; data scientists</strong> ready to move past prototypes</p></li><li><p><strong>Developers &amp; architects</strong> integrating agents into real systems</p></li><li><p><strong>Product leaders &amp; entrepreneurs</strong> exploring business applications of AI</p></li></ul><h4>Why read it</h4><p>You&#8217;ll get <strong>hands-on tutorials, framework comparisons, and practical design patterns</strong>&#8212;everything you need to future-proof your AI development.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.packtpub.com/en-es/product/ai-agents-in-practice-9781805801351&quot;,&quot;text&quot;:&quot;Get it here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.packtpub.com/en-es/product/ai-agents-in-practice-9781805801351"><span>Get it here</span></a></p><div><hr></div><p></p><h1><strong>&#9889;&#65039;Power-Up Corner</strong></h1><p>The core decision framework tells you <em>what</em> to choose, but the real edge comes from knowing the traps to avoid, the hidden costs to plan for, and the hybrid patterns that actually work in practice. This section is a set of accelerators: lessons learned the hard way, checklists for readiness, and techniques that keep systems robust as they move from whiteboard to production. Treat it as a playbook of &#8220;power-ups&#8221; you can reach for when the basics aren&#8217;t enough.</p><p>Here&#8217;s what we&#8217;ll cover:</p><ul><li><p><strong>Hidden costs and common failure modes</strong> &#8211; how rules, ML, and LLMs break differently in the real world.</p></li><li><p><strong>Hybrid patterns that work</strong> &#8211; pragmatic combinations that balance explainability, adaptability, and control.</p></li><li><p><strong>Data readiness checklist</strong> &#8211; how to know when you&#8217;re truly prepared for ML or LLM adoption.</p></li><li><p><strong>Evaluation that maps to reality</strong> &#8211; testing approaches that reflect production conditions, not lab toys.</p></li><li><p><strong>Migration path</strong> &#8211; a sensible, staged evolution from rules to ML to agents.</p></li><li><p><strong>Case study</strong> &#8211; anomaly detection in finance ops as a worked example.</p></li></ul><h2>Hidden costs and common failure modes</h2>
      <p>
          <a href="https://mlpills.substack.com/p/rw-7-when-to-use-rules-ml-or-llms">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[RW #6 - Text-Moderation System with Embeddings]]></title><description><![CDATA[This week we&#8217;re handing you a plug-and-play, notebook-ready tutorial you can drop straight into Jupyter or VS Code. Inside you&#8217;ll find:Why an embedding-plus-classifier pipeline trumps funneling every chat message through a giant LLM&#8212;think millisecond latency, predictable cost, and rock-solid determinism.A cell-by-cell build of that pipeline, with plain-English commentary before and after each code block so you always know what you&#8217;re running and why it matters.By the end you&#8217;ll walk away with production-grade moderation code you can ship as-is&#8212;or adapt to your own data in minutes.]]></description><link>https://mlpills.substack.com/p/rw-6-text-moderation-system-with</link><guid isPermaLink="false">https://mlpills.substack.com/p/rw-6-text-moderation-system-with</guid><dc:creator><![CDATA[David Andrés]]></dc:creator><pubDate>Sun, 03 Aug 2025 07:30:42 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3e8875ca-f968-4536-bbcd-98588a41ab33_1488x761.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1><strong>&#128138; Pill of the Week</strong></h1><p>This week we&#8217;re handing you a plug-and-play, notebook-ready tutorial you can drop straight into Jupyter or VS Code. Inside you&#8217;ll find:</p><ul><li><p><strong>Why an embedding-plus-classifier pipeline trumps funneling every chat message through a giant LLM</strong>&#8212;think millisecond latency, predictable cost, and rock-solid determinism.</p></li><li><p><strong>A cell-by-cell build</strong> of that pipeline, with plain-English commentary before and after each code block so you always know <em>what</em> you&#8217;re running and <em>why</em> it matters.</p></li></ul><p>By the end you&#8217;ll walk away with production-grade moderation code you can ship as-is&#8212;or adapt to your own data in minutes.</p><h3>Why choose an embeddings-based classifier?</h3><p>When a sentence is converted into a dense vector, texts that share meaning land near one another even if the wording differs. Training a lightweight classifier on those vectors brings five practical gains:</p><ol><li><p><strong>Low latency</strong> &#8211; inference often completes in under ten milliseconds on a CPU.</p></li><li><p><strong>Predictable cost</strong> &#8211; you pay only for a fixed-size vector and a tiny model, not per token.</p></li><li><p><strong>Wide language coverage</strong> &#8211; modern encoders such as MiniLM or LaBSE generalise well across dozens of languages.</p></li><li><p><strong>Deterministic output</strong> &#8211; the same input always yields the same label, simplifying appeals.</p></li><li><p><strong>Easy retraining</strong> &#8211; a fresh batch of labelled messages and one fit command are all that is needed when policy changes.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rfd_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96d3569-3e05-4646-8442-f2ed7b917574_1149x288.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rfd_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96d3569-3e05-4646-8442-f2ed7b917574_1149x288.png 424w, https://substackcdn.com/image/fetch/$s_!rfd_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96d3569-3e05-4646-8442-f2ed7b917574_1149x288.png 848w, https://substackcdn.com/image/fetch/$s_!rfd_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96d3569-3e05-4646-8442-f2ed7b917574_1149x288.png 1272w, https://substackcdn.com/image/fetch/$s_!rfd_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96d3569-3e05-4646-8442-f2ed7b917574_1149x288.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rfd_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96d3569-3e05-4646-8442-f2ed7b917574_1149x288.png" width="1149" height="288" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f96d3569-3e05-4646-8442-f2ed7b917574_1149x288.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:288,&quot;width&quot;:1149,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43110,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/169914272?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96d3569-3e05-4646-8442-f2ed7b917574_1149x288.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rfd_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96d3569-3e05-4646-8442-f2ed7b917574_1149x288.png 424w, https://substackcdn.com/image/fetch/$s_!rfd_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96d3569-3e05-4646-8442-f2ed7b917574_1149x288.png 848w, https://substackcdn.com/image/fetch/$s_!rfd_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96d3569-3e05-4646-8442-f2ed7b917574_1149x288.png 1272w, https://substackcdn.com/image/fetch/$s_!rfd_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96d3569-3e05-4646-8442-f2ed7b917574_1149x288.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Why sending every message to an LLM is often overkill</h3><p>LLMs excel at reasoning over long passages, yet they come with seven drawbacks that matter in real deployments:</p><ul><li><p><strong>Cost and throughput</strong>: token-priced calls and GPU reliance make per-message moderation expensive at scale.</p></li><li><p><strong>Latency</strong>: even small hosted models usually take hundreds of milliseconds, which users notice in live chat.</p></li><li><p><strong>Stochasticity</strong>: identical prompts can give different judgements later, complicating audits.</p></li><li><p><strong>Opaque decisions</strong>: explaining a multi-billion-parameter model to regulators is far harder than pointing to a logistic coefficient and a nearest-neighbour example.</p></li><li><p><strong>Prompt attacks</strong>: adversaries can hide violations behind role-play or system-message tricks; a pure classifier ignores prompts altogether.</p></li><li><p><strong>Privacy concerns</strong>: shipping every raw message to a third-party endpoint may breach GDPR or internal policy, whereas local embeddings avoid this.</p></li><li><p><strong>Rate limits and outages</strong>: hosted LLMs throttle traffic; a self-hosted embedding pipeline scales horizontally on standard hardware.</p></li></ul><p>LLMs are still useful for low-volume, long-form reviews or as a second pass on borderline cases, but for a busy chat stream an embedding pipeline wins on speed, cost and predictability.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!58Ms!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad958f3-0c18-47a0-bce5-a58507c4cc1f_935x638.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!58Ms!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad958f3-0c18-47a0-bce5-a58507c4cc1f_935x638.png 424w, https://substackcdn.com/image/fetch/$s_!58Ms!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad958f3-0c18-47a0-bce5-a58507c4cc1f_935x638.png 848w, https://substackcdn.com/image/fetch/$s_!58Ms!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad958f3-0c18-47a0-bce5-a58507c4cc1f_935x638.png 1272w, https://substackcdn.com/image/fetch/$s_!58Ms!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad958f3-0c18-47a0-bce5-a58507c4cc1f_935x638.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!58Ms!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad958f3-0c18-47a0-bce5-a58507c4cc1f_935x638.png" width="688" height="469.45882352941175" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ad958f3-0c18-47a0-bce5-a58507c4cc1f_935x638.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:638,&quot;width&quot;:935,&quot;resizeWidth&quot;:688,&quot;bytes&quot;:104328,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/169914272?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad958f3-0c18-47a0-bce5-a58507c4cc1f_935x638.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!58Ms!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad958f3-0c18-47a0-bce5-a58507c4cc1f_935x638.png 424w, https://substackcdn.com/image/fetch/$s_!58Ms!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad958f3-0c18-47a0-bce5-a58507c4cc1f_935x638.png 848w, https://substackcdn.com/image/fetch/$s_!58Ms!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad958f3-0c18-47a0-bce5-a58507c4cc1f_935x638.png 1272w, https://substackcdn.com/image/fetch/$s_!58Ms!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad958f3-0c18-47a0-bce5-a58507c4cc1f_935x638.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Hands-on example</h3><p>Here we aim to demonstrate how to build a highly efficient and reliable text moderation system for high-volume applications, such as live chat. </p><p>We achieve this by using a lightweight, two-part pipeline: </p><ol><li><p>converting text messages into numerical <strong>sentence embeddings</strong></p></li><li><p>training a small, fast classifier on those embeddings</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!L2Yv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7b3d13b-8ad6-4a56-a9e6-9414fd6fc4aa_1304x190.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!L2Yv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7b3d13b-8ad6-4a56-a9e6-9414fd6fc4aa_1304x190.png 424w, https://substackcdn.com/image/fetch/$s_!L2Yv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7b3d13b-8ad6-4a56-a9e6-9414fd6fc4aa_1304x190.png 848w, https://substackcdn.com/image/fetch/$s_!L2Yv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7b3d13b-8ad6-4a56-a9e6-9414fd6fc4aa_1304x190.png 1272w, https://substackcdn.com/image/fetch/$s_!L2Yv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7b3d13b-8ad6-4a56-a9e6-9414fd6fc4aa_1304x190.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!L2Yv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7b3d13b-8ad6-4a56-a9e6-9414fd6fc4aa_1304x190.png" width="1304" height="190" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b7b3d13b-8ad6-4a56-a9e6-9414fd6fc4aa_1304x190.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:190,&quot;width&quot;:1304,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:37030,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/169914272?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7b3d13b-8ad6-4a56-a9e6-9414fd6fc4aa_1304x190.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!L2Yv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7b3d13b-8ad6-4a56-a9e6-9414fd6fc4aa_1304x190.png 424w, https://substackcdn.com/image/fetch/$s_!L2Yv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7b3d13b-8ad6-4a56-a9e6-9414fd6fc4aa_1304x190.png 848w, https://substackcdn.com/image/fetch/$s_!L2Yv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7b3d13b-8ad6-4a56-a9e6-9414fd6fc4aa_1304x190.png 1272w, https://substackcdn.com/image/fetch/$s_!L2Yv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7b3d13b-8ad6-4a56-a9e6-9414fd6fc4aa_1304x190.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This method is designed to be superior to using a large language model (LLM) for every message, as it provides <strong>extremely low latency</strong>, <strong>predictable and low costs</strong>, and <strong>deterministic outputs</strong> that are easy to audit and scale. </p><p>The following simple flowcharts illustrate how the pipeline processes and flags a message. A message is converted into a vector, scored by the classifier, and then a final decision is made based on the score.</p><ul><li><p>Valid input message:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KtA6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb78dde63-e986-4e94-a1c2-ebce35ee360a_1085x197.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KtA6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb78dde63-e986-4e94-a1c2-ebce35ee360a_1085x197.png 424w, https://substackcdn.com/image/fetch/$s_!KtA6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb78dde63-e986-4e94-a1c2-ebce35ee360a_1085x197.png 848w, https://substackcdn.com/image/fetch/$s_!KtA6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb78dde63-e986-4e94-a1c2-ebce35ee360a_1085x197.png 1272w, https://substackcdn.com/image/fetch/$s_!KtA6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb78dde63-e986-4e94-a1c2-ebce35ee360a_1085x197.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KtA6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb78dde63-e986-4e94-a1c2-ebce35ee360a_1085x197.png" width="1085" height="197" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b78dde63-e986-4e94-a1c2-ebce35ee360a_1085x197.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:197,&quot;width&quot;:1085,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29439,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/169914272?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb78dde63-e986-4e94-a1c2-ebce35ee360a_1085x197.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KtA6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb78dde63-e986-4e94-a1c2-ebce35ee360a_1085x197.png 424w, https://substackcdn.com/image/fetch/$s_!KtA6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb78dde63-e986-4e94-a1c2-ebce35ee360a_1085x197.png 848w, https://substackcdn.com/image/fetch/$s_!KtA6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb78dde63-e986-4e94-a1c2-ebce35ee360a_1085x197.png 1272w, https://substackcdn.com/image/fetch/$s_!KtA6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb78dde63-e986-4e94-a1c2-ebce35ee360a_1085x197.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ul><li><p>Input message that must be moderated:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y2Xo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc1d4ac-0048-4e98-99cd-cfdd3a1e4ba5_1076x177.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y2Xo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc1d4ac-0048-4e98-99cd-cfdd3a1e4ba5_1076x177.png 424w, https://substackcdn.com/image/fetch/$s_!Y2Xo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc1d4ac-0048-4e98-99cd-cfdd3a1e4ba5_1076x177.png 848w, https://substackcdn.com/image/fetch/$s_!Y2Xo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc1d4ac-0048-4e98-99cd-cfdd3a1e4ba5_1076x177.png 1272w, https://substackcdn.com/image/fetch/$s_!Y2Xo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc1d4ac-0048-4e98-99cd-cfdd3a1e4ba5_1076x177.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y2Xo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc1d4ac-0048-4e98-99cd-cfdd3a1e4ba5_1076x177.png" width="1076" height="177" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7bc1d4ac-0048-4e98-99cd-cfdd3a1e4ba5_1076x177.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:177,&quot;width&quot;:1076,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:27862,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/169914272?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc1d4ac-0048-4e98-99cd-cfdd3a1e4ba5_1076x177.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y2Xo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc1d4ac-0048-4e98-99cd-cfdd3a1e4ba5_1076x177.png 424w, https://substackcdn.com/image/fetch/$s_!Y2Xo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc1d4ac-0048-4e98-99cd-cfdd3a1e4ba5_1076x177.png 848w, https://substackcdn.com/image/fetch/$s_!Y2Xo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc1d4ac-0048-4e98-99cd-cfdd3a1e4ba5_1076x177.png 1272w, https://substackcdn.com/image/fetch/$s_!Y2Xo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bc1d4ac-0048-4e98-99cd-cfdd3a1e4ba5_1076x177.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Let&#8217;s begin with the <strong>real-world</strong> coding example:</p><h4>Environment set-up</h4><p>Install the required libraries. Skip this cell if your environment already has them.</p><pre><code>!pip install -q sentence-transformers scikit-learn pandas matplotlib</code></pre><h4>Import core modules</h4><p>We import everything the notebook will need.</p><pre><code>from sentence_transformers import SentenceTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, RocCurveDisplay
import pandas as pd
import matplotlib.pyplot as plt</code></pre><h4>Load and inspect data</h4><p>Our dataset, <code>moderation_dataset.csv</code>, contains two columns:</p><ul><li><p><code>text</code>: the message content</p></li><li><p><code>label</code>: a binary flag where <code>1</code> indicates the message requires moderation, and <code>0</code> means it is acceptable.</p></li></ul><pre><code>df = pd.read_csv("moderation_dataset.csv")
df.head(15)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ox-0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38c009fe-5365-4e3a-a0fe-3016f759d60a_505x627.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ox-0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38c009fe-5365-4e3a-a0fe-3016f759d60a_505x627.png 424w, https://substackcdn.com/image/fetch/$s_!ox-0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38c009fe-5365-4e3a-a0fe-3016f759d60a_505x627.png 848w, https://substackcdn.com/image/fetch/$s_!ox-0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38c009fe-5365-4e3a-a0fe-3016f759d60a_505x627.png 1272w, https://substackcdn.com/image/fetch/$s_!ox-0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38c009fe-5365-4e3a-a0fe-3016f759d60a_505x627.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ox-0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38c009fe-5365-4e3a-a0fe-3016f759d60a_505x627.png" width="457" height="567.4039603960396" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/38c009fe-5365-4e3a-a0fe-3016f759d60a_505x627.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:627,&quot;width&quot;:505,&quot;resizeWidth&quot;:457,&quot;bytes&quot;:80949,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/169914272?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38c009fe-5365-4e3a-a0fe-3016f759d60a_505x627.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ox-0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38c009fe-5365-4e3a-a0fe-3016f759d60a_505x627.png 424w, https://substackcdn.com/image/fetch/$s_!ox-0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38c009fe-5365-4e3a-a0fe-3016f759d60a_505x627.png 848w, https://substackcdn.com/image/fetch/$s_!ox-0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38c009fe-5365-4e3a-a0fe-3016f759d60a_505x627.png 1272w, https://substackcdn.com/image/fetch/$s_!ox-0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38c009fe-5365-4e3a-a0fe-3016f759d60a_505x627.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For this example we&#8217;ve used data from this Kaggle competition: </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data&quot;,&quot;text&quot;:&quot;Dataset&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data"><span>Dataset</span></a></p><p>Let&#8217;s check the label distribution:</p><pre><code><code>df.label.value_counts()</code></code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PXJm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03e82dd-c157-4890-bfdd-db1ef386baf1_159x208.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PXJm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03e82dd-c157-4890-bfdd-db1ef386baf1_159x208.png 424w, https://substackcdn.com/image/fetch/$s_!PXJm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03e82dd-c157-4890-bfdd-db1ef386baf1_159x208.png 848w, https://substackcdn.com/image/fetch/$s_!PXJm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03e82dd-c157-4890-bfdd-db1ef386baf1_159x208.png 1272w, https://substackcdn.com/image/fetch/$s_!PXJm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03e82dd-c157-4890-bfdd-db1ef386baf1_159x208.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PXJm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03e82dd-c157-4890-bfdd-db1ef386baf1_159x208.png" width="135" height="176.60377358490567" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a03e82dd-c157-4890-bfdd-db1ef386baf1_159x208.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:208,&quot;width&quot;:159,&quot;resizeWidth&quot;:135,&quot;bytes&quot;:6622,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/169914272?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03e82dd-c157-4890-bfdd-db1ef386baf1_159x208.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PXJm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03e82dd-c157-4890-bfdd-db1ef386baf1_159x208.png 424w, https://substackcdn.com/image/fetch/$s_!PXJm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03e82dd-c157-4890-bfdd-db1ef386baf1_159x208.png 848w, https://substackcdn.com/image/fetch/$s_!PXJm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03e82dd-c157-4890-bfdd-db1ef386baf1_159x208.png 1272w, https://substackcdn.com/image/fetch/$s_!PXJm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa03e82dd-c157-4890-bfdd-db1ef386baf1_159x208.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Our dataset is highly imbalanced, though it includes a large number of minority class examples. In this example, we'll apply balanced sampling to train on an even class distribution. Also, we will take only 1000 sentences for this example, but more can be used to improve accuracy.</p><pre><code>df = pd.concat((
    df[df.label == 1].sample(500),
    df[df.label == 0].sample(500)
), axis=0)</code></pre><p>Alternatively, we could keep the natural imbalance and compensate using class weights&#8212;particularly useful when resampling isn't desirable or leads to overfitting.</p><h4>Create sentence embeddings</h4><p>We will use OpenAI&#8217;s <code>text-embedding-3-small</code> model with 256 dimensions. This offers high-quality multilingual embeddings in a compact size suitable for low-latency tasks.</p><pre><code>from openai import OpenAI
from tqdm import tqdm

client = OpenAI(api_key=OPENAI_API_KEY)

embeddings = []
for text in tqdm(df["text"].tolist(), desc="Embedding texts"):
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small",
        dimensions=256
    )
    embeddings.append(response.data[0].embedding)</code></pre><p>Make sure the <code>OPENAI_API_KEY</code> is set in your environment beforehand. The 256-dimensional variant strikes a good balance between quality and performance for classification tasks.</p><h4>Prepare train and test sets</h4><p>A stratified split keeps the class ratio consistent.</p><pre><code>X_train, X_test, y_train, y_test = train_test_split(
    embeddings,
    df["label"],
    test_size=0.20,
    stratify=df["label"],
    random_state=42
)</code></pre><p>Using a fixed random seed makes results reproducible.</p><div><hr></div><h1><strong>&#128214; Book of the Week</strong></h1><p>Ready to pair large-language-model magic with the power of knowledge graphs? </p><p>Today I share with you &#8220;<strong>Building Neo4j-Powered Applications with LLMs</strong>&#8221;</p><p><em><strong>Ravindranatha Anthapu</strong></em> and <em><strong>Siddhant Agarwal</strong></em>&#8217;s brand-new guide shows&#8212;step by step&#8212;how to stand up a full Retrieval-Augmented Generation (RAG) stack on Neo4j, then ship it to production on Google Cloud. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.packtpub.com/en-us/product/building-neo4j-powered-applications-with-llms-9781836206224" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_oRw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6857580e-e19c-4246-82a1-8847cf06c93f_2250x2775 424w, https://substackcdn.com/image/fetch/$s_!_oRw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6857580e-e19c-4246-82a1-8847cf06c93f_2250x2775 848w, https://substackcdn.com/image/fetch/$s_!_oRw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6857580e-e19c-4246-82a1-8847cf06c93f_2250x2775 1272w, https://substackcdn.com/image/fetch/$s_!_oRw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6857580e-e19c-4246-82a1-8847cf06c93f_2250x2775 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_oRw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6857580e-e19c-4246-82a1-8847cf06c93f_2250x2775" width="372" height="458.86813186813185" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6857580e-e19c-4246-82a1-8847cf06c93f_2250x2775&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1796,&quot;width&quot;:1456,&quot;resizeWidth&quot;:372,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Building Neo4j-Powered Applications with LLMs&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://www.packtpub.com/en-us/product/building-neo4j-powered-applications-with-llms-9781836206224&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Building Neo4j-Powered Applications with LLMs" title="Building Neo4j-Powered Applications with LLMs" srcset="https://substackcdn.com/image/fetch/$s_!_oRw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6857580e-e19c-4246-82a1-8847cf06c93f_2250x2775 424w, https://substackcdn.com/image/fetch/$s_!_oRw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6857580e-e19c-4246-82a1-8847cf06c93f_2250x2775 848w, https://substackcdn.com/image/fetch/$s_!_oRw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6857580e-e19c-4246-82a1-8847cf06c93f_2250x2775 1272w, https://substackcdn.com/image/fetch/$s_!_oRw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6857580e-e19c-4246-82a1-8847cf06c93f_2250x2775 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why it&#8217;s worth your coffee break?</strong></p><ul><li><p><strong>Hands-on RAG pipeline:</strong> ingest, summarize, embed, and retrieve customer behavior data with LangChain4j, then wire it all together in Spring AI.</p></li><li><p><strong>Graph + vector search that </strong><em><strong>just works</strong></em><strong>:</strong> blend Cypher queries with Haystack&#8217;s hybrid retrieval to serve smarter, context-rich answers.</p></li><li><p><strong>Less hallucination, more reasoning:</strong> grounding techniques and multi-hop patterns that keep LLMs honest.</p></li><li><p><strong>Deploy in one push:</strong> opinionated blueprint for CI/CD to GCP&#8212;including secrets, auth, and cost tips.</p></li></ul><p>You can get it here:</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.packtpub.com/en-us/product/building-neo4j-powered-applications-with-llms-9781836206224&quot;,&quot;text&quot;:&quot;Get it here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.packtpub.com/en-us/product/building-neo4j-powered-applications-with-llms-9781836206224"><span>Get it here</span></a></p><div><hr></div><h4>Train a baseline classifier</h4><p>Logistic regression handles dense embeddings well and trains in seconds.</p><pre><code>clf = LogisticRegression(
    max_iter=200,
    class_weight="balanced"  # helps if classes are uneven
)
clf.fit(X_train, y_train)</code></pre><p>Here we are using <code>class_weight="balanced"</code> in addition to the balanced sampling to handle imbalanced data. This method is often a superior and more robust method for production use cases as it adjusts the loss function without throwing away data.</p><h4>Evaluate initial performance</h4><p>We first look at metrics using the default 0.5 threshold.</p><pre><code>y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, digits=3))</code></pre><p>The result is:</p><pre><code>              precision    recall  f1-score   support

           0      0.885     0.850     0.867       100
           1      0.856     0.890     0.873       100

    accuracy                          0.870       200
   macro avg      0.871     0.870     0.870       200
weighted avg      0.871     0.870     0.870       200</code></pre><p>The model demonstrates a well-balanced performance across both classes, with an overall accuracy of 87%. Unlike in more imbalanced scenarios, here the dataset is evenly split between toxic and non-toxic examples, allowing for a more reliable assessment of the model&#8217;s behavior. The results show a strong and symmetric classification capability: for the non-toxic class, the model achieves a precision of 88.5% and a recall of 85.0%, while for the toxic class, it reaches a precision of 85.6% and a slightly higher recall of 89.0%.</p><p>The F1-scores for both classes are nearly identical&#8212;0.867 for non-toxic and 0.873 for toxic&#8212;indicating that the model handles both categories with comparable effectiveness. The macro and weighted averages are also closely aligned, further confirming the balance. These results suggest that the model is both accurate and fair in its predictions, capable of distinguishing between harmful and harmless content without favoring one class over the other. This kind of performance is especially promising for applications in content moderation, where both false positives and false negatives can carry significant consequences.</p><blockquote><p>If your policy prioritises <strong>identifying as many violations as possible</strong>&#8212;even at the <strong>risk of flagging some false positives</strong>&#8212;then you should focus on <strong>optimising recall</strong> for the positive class. This approach ensures that the model is highly sensitive to potential infractions, catching almost every instance that might constitute a violation, which is particularly critical in contexts like fraud detection, content moderation, or safety compliance. </p></blockquote>
      <p>
          <a href="https://mlpills.substack.com/p/rw-6-text-moderation-system-with">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[RW #5 - No-Code Customer Service agent with LangFlow]]></title><description><![CDATA[Imagine a customer service operation where AI agents don't just rely on generic responses, but actually understand the business inside and out. Agents that can instantly access FAQs, product manuals, and policies&#8212;but only when they need to. This is the reality that Retrieval-Augmented Generation (RAG) combined with ReAct (Reasoning + Acting) agents brings to modern customer service.This issue explores how to build using LangFlow sophisticated AI systems that transform customer inquiry handling. Whether managing product support, policy questions, or technical assistance, this approach revolutionizes customer service operations across industries.]]></description><link>https://mlpills.substack.com/p/rw-5-no-code-customer-service-agent</link><guid isPermaLink="false">https://mlpills.substack.com/p/rw-5-no-code-customer-service-agent</guid><dc:creator><![CDATA[David Andrés]]></dc:creator><pubDate>Sun, 13 Jul 2025 09:41:16 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e9b44c35-0cc5-43e2-835d-6e3c3a23cd0b_669x420.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1><strong>&#128138; Pill of the Week</strong></h1><p>Imagine a customer service operation where AI agents don't just rely on generic responses, but actually understand the business inside and out. Agents that can instantly access FAQs, product manuals, and policies&#8212;but only when they need to. This is the reality that Retrieval-Augmented Generation (RAG) combined with ReAct (Reasoning + Acting) agents brings to modern customer service.</p><p>This issue explores how to build using LangFlow sophisticated AI systems that transform customer inquiry handling. Whether managing product support, policy questions, or technical assistance, this approach revolutionizes customer service operations across industries.</p><h2>The Foundation</h2><p>Before diving into implementation, let's establish the theoretical foundation that makes these intelligent agents possible.</p><h3>What is RAG (Retrieval-Augmented Generation)?</h3><p>RAG is a technique that helps language models become smarter by giving them access to external information&#8212;kind of like letting the model "look things up" before answering a question.</p><p>Normally, language models like GPT are limited to what they learned during training. They don't know anything that happened afterward, and they can sometimes "hallucinate" facts. RAG helps fix that. It works by combining two things: a search system that finds relevant information (retrieval), and a language model that uses that information to generate a response (generation).</p><p><strong>Here's how it works in practice</strong>: when a customer asks a question, the system first searches through a collection of documents or data to find the most relevant pieces. Then it passes those results to the language model, which uses them as context to answer the question. So the answer isn't just based on memory&#8212;it's grounded in real, retrieved content.</p><p>This setup allows RAG to be more accurate, especially for niche or up-to-date topics. Some systems even go further by showing where the information came from or using multiple search steps for complex queries. Of course, it's not perfect&#8212;too much information can overwhelm the model, and hallucinations can still happen&#8212;but overall, it's a big step toward more trustworthy AI systems.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;2c5b016e-5947-42e5-8928-aa11cea5bd97&quot;,&quot;caption&quot;:&quot;&#128138; Pill of the week&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Issue #56 - Retrieval-Augmented Generation&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:38707812,&quot;name&quot;:&quot;David Andr&#233;s&quot;,&quot;bio&quot;:&quot;&#128188; Data Scientist &#8226; &#128013; Python enthusiast&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db6423b2-36bc-440c-be7d-b54be5bad1b0_1447x1448.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null},{&quot;id&quot;:132707413,&quot;name&quot;:&quot;Josep Ferrer&quot;,&quot;bio&quot;:&quot;Outstand using data -- Data Science, Design and Tech Tech Writer @KDnuggets @DataCamp &#128073;&#127995;Inquiries in rfeers@gmail.com&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd196b5a6-59f2-46dd-99b3-e10ab1bbd27d_604x604.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-27T16:04:33.052Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7ac9c1ea-0a3f-40a9-86f4-bdba47a61df5_1920x1221.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://mlpills.substack.com/p/issue-56-retrieval-augmented-generation&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144067322,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:12,&quot;comment_count&quot;:3,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Machine Learning Pills&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!YODk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dba4244-97d2-48f0-a2bb-b01c7ea74212_118x118.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h2><strong>LangFlow: Making AI Workflows Visual and Accessible</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Irsm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0660f5f0-61ef-41e7-8a42-b941af2b1c84_1362x402.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Irsm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0660f5f0-61ef-41e7-8a42-b941af2b1c84_1362x402.png 424w, https://substackcdn.com/image/fetch/$s_!Irsm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0660f5f0-61ef-41e7-8a42-b941af2b1c84_1362x402.png 848w, https://substackcdn.com/image/fetch/$s_!Irsm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0660f5f0-61ef-41e7-8a42-b941af2b1c84_1362x402.png 1272w, https://substackcdn.com/image/fetch/$s_!Irsm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0660f5f0-61ef-41e7-8a42-b941af2b1c84_1362x402.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Irsm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0660f5f0-61ef-41e7-8a42-b941af2b1c84_1362x402.png" width="254" height="74.96916299559471" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0660f5f0-61ef-41e7-8a42-b941af2b1c84_1362x402.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:402,&quot;width&quot;:1362,&quot;resizeWidth&quot;:254,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Needle | The Knowledge Threading Platform&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Needle | The Knowledge Threading Platform" title="Needle | The Knowledge Threading Platform" srcset="https://substackcdn.com/image/fetch/$s_!Irsm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0660f5f0-61ef-41e7-8a42-b941af2b1c84_1362x402.png 424w, https://substackcdn.com/image/fetch/$s_!Irsm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0660f5f0-61ef-41e7-8a42-b941af2b1c84_1362x402.png 848w, https://substackcdn.com/image/fetch/$s_!Irsm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0660f5f0-61ef-41e7-8a42-b941af2b1c84_1362x402.png 1272w, https://substackcdn.com/image/fetch/$s_!Irsm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0660f5f0-61ef-41e7-8a42-b941af2b1c84_1362x402.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>LangFlow</strong> is a <strong>visual, no-code platform</strong> that transforms complex AI workflows into intuitive drag-and-drop interfaces. Think of it as the "Figma for AI systems"&#8212;instead of writing hundreds of lines of code to connect language models, databases, and APIs, you build sophisticated AI agents by connecting components visually. Each component represents a specific function (like loading documents, creating embeddings, or querying databases), and you connect them with simple lines to create the data flow.</p><p>What makes LangFlow particularly powerful for RAG systems is how it handles the complexity behind the scenes. Setting up a traditional RAG pipeline involves managing multiple APIs, handling data transformations, coordinating between different AI models, and ensuring everything works together seamlessly. <strong>LangFlow abstracts away this complexity while still giving you full control over each component's behavior and configuration.</strong></p><p><strong>For businesses</strong>, this means the <strong>difference between needing a team of AI engineers and being able to prototype and deploy intelligent agents with existing technical staff</strong>. The visual approach also makes it easier to understand, modify, and troubleshoot these systems&#8212;you can literally see how information flows from customer questions through document retrieval to final responses.</p><p><em>Now that we understand how LangFlow simplifies AI workflows, let&#8217;s walk through the architecture of a typical customer service agent built with this tool.</em></p><h2>The Architecture</h2><p>Now that we understand what RAG means in practice, let's explore how these concepts come together in a practical LangFlow implementation.</p><h3>1. Document Foundation</h3><p>Before an AI agent can answer questions intelligently, it must first <strong>understand how information is organized</strong>. Raw documents&#8212;whether FAQs, product manuals, or policies&#8212;aren&#8217;t optimized for retrieval. That&#8217;s where <em>chunking</em> comes in. It transforms unstructured or semi-structured content into digestible, context-aware units that balance <strong>retrieval accuracy, performance, and storage tradeoffs</strong>. The goal isn&#8217;t just to split documents&#8212;it&#8217;s to preserve meaning and relevance.</p><p><strong>Chunking</strong> isn't just about size limits&#8212;it's about optimizing information density and retrieval relevance. Sophisticated approaches include:</p><ul><li><p><strong>Semantic Chunking</strong>: Split based on topic boundaries rather than character counts</p></li><li><p><strong>Hierarchical Chunking</strong>: Create chunks at multiple granularities (paragraph, section, document)</p></li><li><p><strong>Overlapping Windows</strong>: Sliding windows with strategic overlap to maintain context</p></li><li><p><strong>Metadata-Aware Chunking</strong>: Preserve document structure (headers, tables, lists) in chunk boundaries</p></li></ul><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;b1ad8bf8-98fc-4540-82f4-52f91a9b50f8&quot;,&quot;caption&quot;:&quot;&#128138; Pill of the Week&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Issue #63 - Text chunking for RAG systems&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:38707812,&quot;name&quot;:&quot;David Andr&#233;s&quot;,&quot;bio&quot;:&quot;&#128188; Data Scientist &#8226; &#128013; Python enthusiast&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db6423b2-36bc-440c-be7d-b54be5bad1b0_1447x1448.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null},{&quot;id&quot;:132707413,&quot;name&quot;:&quot;Josep Ferrer&quot;,&quot;bio&quot;:&quot;Outstand using data -- Data Science, Design and Tech Tech Writer @KDnuggets @DataCamp &#128073;&#127995;Inquiries in rfeers@gmail.com&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd196b5a6-59f2-46dd-99b3-e10ab1bbd27d_604x604.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-06-29T17:34:54.776Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4daf8dd2-ef96-4984-8b0f-68ed0ff903c6_1920x1280.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://mlpills.substack.com/p/issue-64-text-chunking-for-rag-systems&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:146103471,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:7,&quot;comment_count&quot;:2,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Machine Learning Pills&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!YODk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dba4244-97d2-48f0-a2bb-b01c7ea74212_118x118.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>In <strong>LangFlow</strong>, this begins with a File Loader component connected to a Text Splitter. The configuration typically uses 1,000 characters with 200-character overlap, balancing context preservation with retrieval precision. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UVRp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F996ffb5a-fc9e-426e-8401-2181a16dd9e2_1307x731.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UVRp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F996ffb5a-fc9e-426e-8401-2181a16dd9e2_1307x731.png 424w, https://substackcdn.com/image/fetch/$s_!UVRp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F996ffb5a-fc9e-426e-8401-2181a16dd9e2_1307x731.png 848w, https://substackcdn.com/image/fetch/$s_!UVRp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F996ffb5a-fc9e-426e-8401-2181a16dd9e2_1307x731.png 1272w, https://substackcdn.com/image/fetch/$s_!UVRp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F996ffb5a-fc9e-426e-8401-2181a16dd9e2_1307x731.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UVRp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F996ffb5a-fc9e-426e-8401-2181a16dd9e2_1307x731.png" width="644" height="360.1866870696251" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/996ffb5a-fc9e-426e-8401-2181a16dd9e2_1307x731.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:731,&quot;width&quot;:1307,&quot;resizeWidth&quot;:644,&quot;bytes&quot;:80498,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/168157534?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F996ffb5a-fc9e-426e-8401-2181a16dd9e2_1307x731.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UVRp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F996ffb5a-fc9e-426e-8401-2181a16dd9e2_1307x731.png 424w, https://substackcdn.com/image/fetch/$s_!UVRp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F996ffb5a-fc9e-426e-8401-2181a16dd9e2_1307x731.png 848w, https://substackcdn.com/image/fetch/$s_!UVRp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F996ffb5a-fc9e-426e-8401-2181a16dd9e2_1307x731.png 1272w, https://substackcdn.com/image/fetch/$s_!UVRp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F996ffb5a-fc9e-426e-8401-2181a16dd9e2_1307x731.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In this example we load two files: general company information and the FAQs in their website. Both are essential information that a customer may want to know when contacting a customer service bot.</p><p><strong>Business Reality</strong>: A company's knowledge base includes FAQs, product manuals, policy documents, and troubleshooting guides. These documents vary in length and complexity&#8212;from short FAQ answers to comprehensive product specifications.</p><p><strong>Performance Considerations</strong>:</p><ul><li><p><strong>Chunk Size vs. Precision</strong>: Smaller chunks increase precision but may lose context</p></li><li><p><strong>Overlap vs. Storage</strong>: More overlap improves recall but increases storage requirements</p></li><li><p><strong>Retrieval Granularity</strong>: Match chunk size to expected query complexity</p></li></ul><h3>2. Semantic Understanding</h3><p><strong>What Are Embeddings?</strong> Embeddings are how AI models represent meaning using numbers. Instead of just seeing words as symbols, embeddings let models understand how words, phrases, or documents relate to each other.</p><p>When a model processes a sentence, it turns it into a vector&#8212;a list of numbers that capture its meaning. Sentences with similar meanings get vectors that are close together. That's why embeddings are so useful in things like search, clustering, and recommendation&#8212;they let the AI measure semantic similarity.</p><p>Modern embeddings are contextual, which means the meaning of a word changes depending on where it appears&#8212;just like how "bank" means something different near "river" than it does near "money." And as models advance, embeddings can even represent text, images, and code in the same shared space, enabling powerful cross-modal applications.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;156215cd-9fef-42aa-b7ce-255b685deca8&quot;,&quot;caption&quot;:&quot;&#128138; Pill of the week&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Issue #58 - Embeddings in NLP&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:38707812,&quot;name&quot;:&quot;David Andr&#233;s&quot;,&quot;bio&quot;:&quot;&#128188; Data Scientist &#8226; &#128013; Python enthusiast&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db6423b2-36bc-440c-be7d-b54be5bad1b0_1447x1448.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null},{&quot;id&quot;:132707413,&quot;name&quot;:&quot;Josep Ferrer&quot;,&quot;bio&quot;:&quot;Outstand using data -- Data Science, Design and Tech Tech Writer @KDnuggets @DataCamp &#128073;&#127995;Inquiries in rfeers@gmail.com&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd196b5a6-59f2-46dd-99b3-e10ab1bbd27d_604x604.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-05-18T15:53:41.878Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97594853-b1e7-4ca4-b941-1f42d4d13227_1920x1280.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://mlpills.substack.com/p/issue-58-embeddings-in-nlp&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:144745469,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:13,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Machine Learning Pills&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!YODk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dba4244-97d2-48f0-a2bb-b01c7ea74212_118x118.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><strong>In LangFlow</strong>, document chunks feed into an Embedding Model component using OpenAI's <code>text-embedding-3-small</code> or similar. Each chunk transforms into a dense vector&#8212;a numerical representation capturing semantic meaning.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v6Bu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e090180-36df-4039-a158-ec8721c5fdd6_611x681.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v6Bu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e090180-36df-4039-a158-ec8721c5fdd6_611x681.png 424w, https://substackcdn.com/image/fetch/$s_!v6Bu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e090180-36df-4039-a158-ec8721c5fdd6_611x681.png 848w, https://substackcdn.com/image/fetch/$s_!v6Bu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e090180-36df-4039-a158-ec8721c5fdd6_611x681.png 1272w, https://substackcdn.com/image/fetch/$s_!v6Bu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e090180-36df-4039-a158-ec8721c5fdd6_611x681.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v6Bu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e090180-36df-4039-a158-ec8721c5fdd6_611x681.png" width="325" height="362.2340425531915" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1e090180-36df-4039-a158-ec8721c5fdd6_611x681.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:681,&quot;width&quot;:611,&quot;resizeWidth&quot;:325,&quot;bytes&quot;:42390,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/168157534?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e090180-36df-4039-a158-ec8721c5fdd6_611x681.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v6Bu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e090180-36df-4039-a158-ec8721c5fdd6_611x681.png 424w, https://substackcdn.com/image/fetch/$s_!v6Bu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e090180-36df-4039-a158-ec8721c5fdd6_611x681.png 848w, https://substackcdn.com/image/fetch/$s_!v6Bu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e090180-36df-4039-a158-ec8721c5fdd6_611x681.png 1272w, https://substackcdn.com/image/fetch/$s_!v6Bu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e090180-36df-4039-a158-ec8721c5fdd6_611x681.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Real-World Impact</strong>: When customers ask about "product returns," "item refunds," or "sending back purchases," the embedding model recognizes these as semantically related concepts. The agent finds relevant information based on meaning, not just keyword matching.</p><h3>3. Knowledge Storage</h3><p><strong>What is a Vector Database?</strong> A vector database is built for a very different kind of search&#8212;semantic search. Instead of matching keywords, it finds things that mean the same thing, even if they use different words. To do this, it stores information as vectors&#8212;large sets of numbers that capture the meaning of text, images, or other content.</p><p>For example, a sentence like "I love chocolate" might be turned into a 1,536-dimensional vector using an embedding model. Then, when you search for "favorite desserts," the database can find documents with similar meanings, even if the words don't exactly match.</p><p>Behind the scenes, it uses special data structures&#8212;like HNSW graphs&#8212;to find the closest vectors quickly, even in massive datasets. And to compare similarity, it relies on mathematical techniques like cosine similarity or Euclidean distance. This kind of search is at the heart of recommendation systems, chatbots, and any AI tool that needs to understand what content is about, not just how it's phrased.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;5820789a-6694-43a6-a315-94f0add3e02c&quot;,&quot;caption&quot;:&quot;Today we are introducing a new type of issue: &#8220;Podcast notes&#8221; &#127881;&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Issue #55 - Vector Databases and their importance&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:38707812,&quot;name&quot;:&quot;David Andr&#233;s&quot;,&quot;bio&quot;:&quot;&#128188; Data Scientist &#8226; &#128013; Python enthusiast&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db6423b2-36bc-440c-be7d-b54be5bad1b0_1447x1448.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null},{&quot;id&quot;:132707413,&quot;name&quot;:&quot;Josep Ferrer&quot;,&quot;bio&quot;:&quot;Outstand using data -- Data Science, Design and Tech Tech Writer @KDnuggets @DataCamp &#128073;&#127995;Inquiries in rfeers@gmail.com&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd196b5a6-59f2-46dd-99b3-e10ab1bbd27d_604x604.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-04-20T11:31:02.082Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6649b34a-d70f-49d5-8347-f225fef8910a_1920x1280.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://mlpills.substack.com/p/issue-55-vector-databases-and-their&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:143749725,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:10,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Machine Learning Pills&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!YODk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dba4244-97d2-48f0-a2bb-b01c7ea74212_118x118.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p><strong>In LangFlow</strong>, ChromaDB is configured with a collection name and persistent storage directory. This creates the agent's knowledge repository that survives system restarts and builds intelligence over time.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CSPG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0727fd-790c-46c0-b927-e68b166cd48e_573x832.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CSPG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0727fd-790c-46c0-b927-e68b166cd48e_573x832.png 424w, https://substackcdn.com/image/fetch/$s_!CSPG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0727fd-790c-46c0-b927-e68b166cd48e_573x832.png 848w, https://substackcdn.com/image/fetch/$s_!CSPG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0727fd-790c-46c0-b927-e68b166cd48e_573x832.png 1272w, https://substackcdn.com/image/fetch/$s_!CSPG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0727fd-790c-46c0-b927-e68b166cd48e_573x832.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CSPG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0727fd-790c-46c0-b927-e68b166cd48e_573x832.png" width="325" height="471.90226876090753" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc0727fd-790c-46c0-b927-e68b166cd48e_573x832.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:832,&quot;width&quot;:573,&quot;resizeWidth&quot;:325,&quot;bytes&quot;:44886,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/168157534?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0727fd-790c-46c0-b927-e68b166cd48e_573x832.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CSPG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0727fd-790c-46c0-b927-e68b166cd48e_573x832.png 424w, https://substackcdn.com/image/fetch/$s_!CSPG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0727fd-790c-46c0-b927-e68b166cd48e_573x832.png 848w, https://substackcdn.com/image/fetch/$s_!CSPG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0727fd-790c-46c0-b927-e68b166cd48e_573x832.png 1272w, https://substackcdn.com/image/fetch/$s_!CSPG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc0727fd-790c-46c0-b927-e68b166cd48e_573x832.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why did we choose ChromaDB?</strong> It is an ideal vector store for businesses because it runs <strong>locally</strong> for full data privacy and cost control, requires <strong>minimal setup</strong> without complex cluster management, supports <strong>persistent storage</strong> and multiple similarity metrics, and offers metadata filtering plus collection management for flexible, efficient, and secure retrieval workflows.</p><h3>4. Tool Integration</h3><p>Storing knowledge is only part of the equation&#8212;making it actionable is the next step.</p><p><strong>What Are Tools in AI Systems?</strong> In the world of AI agents, tools are like superpowers. While a language model on its own can generate and understand text, tools let it interact with the world&#8212;run calculations, look up real-time info, call APIs, query a database, and more.</p><p>Each tool is a function with a clear structure: it has inputs, outputs, and a short explanation so the AI knows when to use it. When the agent needs to solve a problem, it can choose the right tool, use it, look at the results, and continue from there. This unlocks a whole new level of usefulness.</p><p>Advanced AI agents can even chain tools together to perform complex tasks, learn about new tools at runtime, or check if a tool call is valid before using it. It's a way of extending the model's reach&#8212;connecting reasoning with real-world capabilities.</p><p><strong>In LangFlow</strong>, the critical step involves enabling "Tool Mode" on the ChromaDB component. This transforms the database from a static resource into a dynamic tool the agent can call, creating a discrete function called <code>SEARCH_DOCUMENTS</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3FiL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bf5a15b-8215-413e-9040-ebc02fb598df_523x715.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3FiL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bf5a15b-8215-413e-9040-ebc02fb598df_523x715.png 424w, https://substackcdn.com/image/fetch/$s_!3FiL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bf5a15b-8215-413e-9040-ebc02fb598df_523x715.png 848w, https://substackcdn.com/image/fetch/$s_!3FiL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bf5a15b-8215-413e-9040-ebc02fb598df_523x715.png 1272w, https://substackcdn.com/image/fetch/$s_!3FiL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bf5a15b-8215-413e-9040-ebc02fb598df_523x715.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3FiL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bf5a15b-8215-413e-9040-ebc02fb598df_523x715.png" width="373" height="509.93307839388143" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3bf5a15b-8215-413e-9040-ebc02fb598df_523x715.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:715,&quot;width&quot;:523,&quot;resizeWidth&quot;:373,&quot;bytes&quot;:47476,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/168157534?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bf5a15b-8215-413e-9040-ebc02fb598df_523x715.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3FiL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bf5a15b-8215-413e-9040-ebc02fb598df_523x715.png 424w, https://substackcdn.com/image/fetch/$s_!3FiL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bf5a15b-8215-413e-9040-ebc02fb598df_523x715.png 848w, https://substackcdn.com/image/fetch/$s_!3FiL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bf5a15b-8215-413e-9040-ebc02fb598df_523x715.png 1272w, https://substackcdn.com/image/fetch/$s_!3FiL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bf5a15b-8215-413e-9040-ebc02fb598df_523x715.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>Why This Architecture Matters</strong>: Instead of always searching or never searching, the agent now makes intelligent decisions about when to use the tool, what parameters to pass, and how to interpret results.</p></blockquote><h3>5. The Reasoning Engine</h3><p><strong>What is a ReAct Agent?</strong> A ReAct agent is a type of AI that doesn't just generate answers&#8212;it thinks through problems and takes actions to solve them. The name stands for "<strong>Reasoning + Acting</strong>", and that's exactly how it works: the agent reasons about the situation, takes an action (like calling a tool or API), looks at the result, and then decides what to do next. This loop continues until it completes the task.</p><p>What makes ReAct agents powerful is their ability to adapt on the fly. They don't just follow a fixed script&#8212;they learn from what happens along the way. If a step fails, they can recognize it and try a different approach. They can even reflect on their own reasoning and revise it when needed. This makes them especially good at handling complex tasks that require multiple steps or tools, and they're more transparent too&#8212;you can follow their train of thought as they work.</p><p>The Agent component in LangFlow uses a large language model (like GPT-4) with carefully crafted instructions: "Act as a helpful customer service representative. Use the SEARCH_DOCUMENTS tool when customers ask about specific products, policies, or procedures."</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kraF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca506332-e679-41fd-9773-89358db89a57_856x697.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kraF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca506332-e679-41fd-9773-89358db89a57_856x697.png 424w, https://substackcdn.com/image/fetch/$s_!kraF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca506332-e679-41fd-9773-89358db89a57_856x697.png 848w, https://substackcdn.com/image/fetch/$s_!kraF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca506332-e679-41fd-9773-89358db89a57_856x697.png 1272w, https://substackcdn.com/image/fetch/$s_!kraF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca506332-e679-41fd-9773-89358db89a57_856x697.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kraF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca506332-e679-41fd-9773-89358db89a57_856x697.png" width="856" height="697" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ca506332-e679-41fd-9773-89358db89a57_856x697.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:697,&quot;width&quot;:856,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:67853,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/168157534?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca506332-e679-41fd-9773-89358db89a57_856x697.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kraF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca506332-e679-41fd-9773-89358db89a57_856x697.png 424w, https://substackcdn.com/image/fetch/$s_!kraF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca506332-e679-41fd-9773-89358db89a57_856x697.png 848w, https://substackcdn.com/image/fetch/$s_!kraF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca506332-e679-41fd-9773-89358db89a57_856x697.png 1272w, https://substackcdn.com/image/fetch/$s_!kraF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca506332-e679-41fd-9773-89358db89a57_856x697.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>The Decision Process</strong>: For each customer query, the agent follows this pattern:</p><ol><li><p><strong>Analyze</strong>: What type of question is this?</p></li><li><p><strong>Decide</strong>: Do I need to search company knowledge?</p></li><li><p><strong>Act</strong>: Call the search tool with relevant terms</p></li><li><p><strong>Synthesize</strong>: Combine retrieved information with clear explanation</p></li><li><p><strong>Respond</strong>: Provide accurate, helpful customer service</p></li></ol><p><strong>Advanced Reasoning</strong>: The agent can chain multiple searches, refine its queries, and synthesize information from different sources&#8212;searching for product specifications, then pricing information, then warranty details to answer one comprehensive customer question.</p><h3>6. Customer Experience</h3><p><strong>The Components</strong>: Chat Input and Chat Output create the conversational interface, but the real sophistication lies in how the system manages the entire interaction flow.</p><p><strong>The Result</strong>: Customers interact with what feels like the company's most knowledgeable service representative&#8212;someone who has instant access to every piece of company information but only references what's relevant to their specific need.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!03Da!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F355e114c-99a5-468d-b8c0-59d178830a06_1321x781.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!03Da!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F355e114c-99a5-468d-b8c0-59d178830a06_1321x781.png 424w, https://substackcdn.com/image/fetch/$s_!03Da!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F355e114c-99a5-468d-b8c0-59d178830a06_1321x781.png 848w, https://substackcdn.com/image/fetch/$s_!03Da!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F355e114c-99a5-468d-b8c0-59d178830a06_1321x781.png 1272w, https://substackcdn.com/image/fetch/$s_!03Da!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F355e114c-99a5-468d-b8c0-59d178830a06_1321x781.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!03Da!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F355e114c-99a5-468d-b8c0-59d178830a06_1321x781.png" width="1321" height="781" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/355e114c-99a5-468d-b8c0-59d178830a06_1321x781.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:781,&quot;width&quot;:1321,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:202453,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/168157534?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb639ef0-0880-4d82-90c7-b06e90697a85_1339x781.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!03Da!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F355e114c-99a5-468d-b8c0-59d178830a06_1321x781.png 424w, https://substackcdn.com/image/fetch/$s_!03Da!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F355e114c-99a5-468d-b8c0-59d178830a06_1321x781.png 848w, https://substackcdn.com/image/fetch/$s_!03Da!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F355e114c-99a5-468d-b8c0-59d178830a06_1321x781.png 1272w, https://substackcdn.com/image/fetch/$s_!03Da!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F355e114c-99a5-468d-b8c0-59d178830a06_1321x781.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We load company information and split it into chunks, then store them in a vector database. To enable efficient search, we use an embedding model connected to the vector database to convert text into numerical representations that capture meaning. We convert the vector database search into a tool so the LLM can use it to find relevant information when answering customer questions. Finally, we connect this search tool to the LLM along with chat input and output components, creating an intelligent customer service agent that can access company knowledge on-demand to provide accurate, helpful responses.</p><blockquote><p><strong>LangFlow is great for rapid prototyping,</strong> offering a visual and intuitive interface. However, it can <strong>limit the flexibility</strong> required for building more complex chatbots and advanced agent workflows&#8212;something <strong>LangGraph</strong> handles more effectively with its <strong>greater control and modularity</strong>. You can find an example of how to build a ReAct agent in LangGraph in the following previous issue:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;ec165c96-3f0d-4311-867c-ae689dcb69d1&quot;,&quot;caption&quot;:&quot;&#128138; Pill of the Week&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;DIY #14 - Step-by-step implementation of a ReAct Agent in LangGraph&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:38707812,&quot;name&quot;:&quot;David Andr&#233;s&quot;,&quot;bio&quot;:&quot;&#128188; Data Scientist &#8226; &#128013; Python enthusiast&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db6423b2-36bc-440c-be7d-b54be5bad1b0_1447x1448.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-04-20T12:37:39.926Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!du7b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06f2cf46-df40-48f9-a798-931222b0f70a_590x592.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://mlpills.substack.com/p/diy-14-step-by-step-implementation&quot;,&quot;section_name&quot;:&quot;DIY&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:161105148,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:9,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Machine Learning Pills&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!YODk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dba4244-97d2-48f0-a2bb-b01c7ea74212_118x118.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div></blockquote><div><hr></div><h1>&#127891; Two Paths to Mastering AI*</h1><h3>&#128104;&#8205;&#128187; <strong>Build and Launch Your Own AI Product</strong></h3><p><strong>Full-Stack LLM Developer Certification</strong><br>Become one of the first certified LLM developers and gain job-ready skills:</p><ul><li><p>Build a <strong>real-world AI product</strong> (like a RAG-powered tutor)</p></li><li><p>Learn Prompting, RAG, Fine-tuning, LLM Agents &amp; Deployment</p></li><li><p>Create a <strong>standout portfolio</strong> + walk away with <strong>certification</strong></p></li><li><p>Ideal for developers, data scientists &amp; tech professionals</p></li></ul><blockquote><p>&#127919; Finish with a project you can pitch, deploy, or get hired for<br>&#128736;&#65039; 90+ hands-on lessons | Slack support | Weekly updates<br>&#128188; Secure your place in a high-demand, high-impact GenAI role</p></blockquote><p>&#128279; <a href="https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev?ref=3b122f">Join the August Cohort &#8211; Start Building</a></p><p></p><h3>&#128188; <strong>Lead the AI Shift in Your Industry</strong></h3><p><strong>AI for Business Professionals</strong><br>No code needed. Learn how to work <em>smarter</em> with AI in your role:</p><ul><li><p>Save <strong>10+ hours/week</strong> with smart prompting &amp; workflows</p></li><li><p>Use tools like ChatGPT, Claude, Gemini, and Perplexity</p></li><li><p>Lead AI adoption and innovation in your team or company</p></li><li><p>Includes role-specific modules (Sales, HR, Product, etc.)</p></li></ul><blockquote><p>&#9989; Build an &#8220;AI-first&#8221; mindset &amp; transform the way you work<br>&#128269; Learn to implement real AI use cases across your org<br>&#129504; Perfect for managers, consultants, and business leaders</p></blockquote><p>&#128279; <a href="https://academy.towardsai.net/courses/AI-for-Business-Professionals-course?ref=3b122f">Start Working Smarter with AI Today</a></p><p></p><p>&#127891; <strong>Both Courses Come With:</strong></p><ul><li><p>Certification</p></li><li><p>Lifetime Access &amp; Weekly Updates</p></li><li><p>30-Day Money-Back Guarantee</p></li><li><p>Hands-On Projects, Not Just Theory</p></li></ul><p>&#128197; <strong>August Cohorts Now Open &#8212; Don&#8217;t Wait</strong><br>&#128293; Invest in the skillset that will define the next decade.</p><p><em>*Sponsored: by purchasing any of their courses you would also be supporting MLPills.</em></p><div><hr></div><h2>Real-World Customer Service Scenario</h2><p>Let's examine how this system handles typical customer service interactions:</p><p><strong>Customer</strong>: "What's the return policy for electronics?"</p><p><strong>Agent's Internal Process</strong>:</p><ol><li><p><em>Reasoning</em>: "This is a specific policy question requiring company knowledge"</p></li><li><p><em>Action</em>: Calls <code>SEARCH_DOCUMENTS</code> with parameters targeting electronics return policies</p></li><li><p><em>Observation</em>: Receives relevant policy chunks with metadata</p></li><li><p><em>Synthesis</em>: Combines policy details with clear, customer-friendly explanation</p></li><li><p><em>Response</em>: "For electronics, the return policy allows returns within 30 days of purchase with original packaging. Here are the specific steps..."</p></li></ol><p><strong>Customer</strong>: "My wireless headphones won't charge properly"</p><p><strong>Agent's Internal Process</strong>:</p><ol><li><p><em>Reasoning</em>: "This is a technical issue requiring troubleshooting knowledge"</p></li><li><p><em>Action</em>: Calls <code>SEARCH_DOCUMENTS</code> with "wireless headphones charging problems"</p></li><li><p><em>Observation</em>: Retrieves relevant troubleshooting procedures</p></li><li><p><em>Multi-step Planning</em>: Identifies a sequence of diagnostic steps</p></li><li><p><em>Response</em>: "I can help you with that charging issue. Let's try these troubleshooting steps in order..."</p></li></ol><p><strong>Customer</strong>: "Hi there, how are you?"</p><p><strong>Agent's Internal Process</strong>:</p><ol><li><p><em>Reasoning</em>: "This is a social greeting&#8212;no company knowledge needed"</p></li><li><p><em>Direct Response</em>: "Hello! I'm doing well, thank you for asking. How can I help you today?"</p></li></ol><p>Notice how the agent adapts its behavior based on the customer's specific needs, demonstrating the power of ReAct architecture in real-world applications.</p><h2>The Business Impact: Why This Matters</h2>
      <p>
          <a href="https://mlpills.substack.com/p/rw-5-no-code-customer-service-agent">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[RW #4 - EDA applied to Netflix (part II)]]></title><description><![CDATA[Welcome back! In Part I, we laid the groundwork for exploring Netflix&#8217;s data&#8212;distributions, ratings, and top producing countries. Now, in Part II, we&#8217;re diving deeper into specific questions you can answer with this dataset. These insights further illustrate how each EDA discovery can supercharge a machine learning project.]]></description><link>https://mlpills.substack.com/p/rw-4-eda-applied-to-netflix-part</link><guid isPermaLink="false">https://mlpills.substack.com/p/rw-4-eda-applied-to-netflix-part</guid><dc:creator><![CDATA[Muhammad Anas]]></dc:creator><pubDate>Sun, 13 Apr 2025 19:19:09 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e641cd81-6b62-4d22-95e3-758822b7f2b4_1024x966.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>&#128138; <strong>Pill of the Week</strong></h1><p>Welcome back! In <a href="https://mlpills.substack.com/p/rw-3-eda-applied-to-netflix-part">Part I</a>, we laid the groundwork for exploring Netflix&#8217;s data&#8212;distributions, ratings, and top producing countries. Now, in <strong>Part II</strong>, we&#8217;re diving deeper into specific questions you can answer with this dataset. These insights further illustrate how each EDA discovery can supercharge a machine learning project.</p><blockquote><p>&#9999;&#65039; Article and code by Muhammad Anas.</p></blockquote><p>Do you want a <strong>reminder</strong>? Check <strong>part I</strong> before moving on:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;41449048-5ea2-40df-8ae6-fcf9106d686d&quot;,&quot;caption&quot;:&quot;&#128138; Pill of the Week&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;RW #3 - EDA applied to Netflix (part I)&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:38707812,&quot;name&quot;:&quot;David Andr&#233;s&quot;,&quot;bio&quot;:&quot;&#128188; Data Scientist &#8226; &#128013; Python enthusiast&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db6423b2-36bc-440c-be7d-b54be5bad1b0_1447x1448.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null},{&quot;id&quot;:236084597,&quot;name&quot;:&quot;Muhammad Anas&quot;,&quot;bio&quot;:&quot;random drops multiple times a week to keep you up to date with the field&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68943441-be86-431c-9ca5-876167a3ab9e_3024x4032.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-03-30T06:30:57.235Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1925bf0f-1040-449a-98b6-259d82b46c10_1536x1024.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://mlpills.substack.com/p/rw-3-eda-applied-to-netflix-part&quot;,&quot;section_name&quot;:&quot;Real-World&quot;,&quot;video_upload_id&quot;:null,&quot;id&quot;:159603757,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:13,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Machine Learning Pills&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dba4244-97d2-48f0-a2bb-b01c7ea74212_118x118.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h2>What will be covered in this part?</h2><p>We will continue exploring EDA techniques by using the Netflix example. Here is a summary of what will be covered in this issue:</p><ul><li><p>Genre Breakdown (Movies vs. TV Shows)</p></li><li><p>Top 10 Actors on Netflix</p></li><li><p>Movies vs. TV Shows Over Time</p></li><li><p>Best Time to Release on Netflix</p></li><li><p>TV Shows with the Most Seasons</p></li></ul><div class="pullquote"><p><em>&#128142;<strong>Next</strong> <strong>Wednesday</strong> we will send to all <strong>paid subscribers</strong> the <strong>full notebook</strong>, including all the code. This is a <strong>one-time send</strong>&#8212;only subscribers with an active paid membership at that time will receive it via email.&#128142;</em></p></div><h2>6. Genre Breakdown: Movies vs. TV Shows</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DeTc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F935f2dee-efe5-4c85-adcf-3f9362681253_1216x490.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DeTc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F935f2dee-efe5-4c85-adcf-3f9362681253_1216x490.png 424w, https://substackcdn.com/image/fetch/$s_!DeTc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F935f2dee-efe5-4c85-adcf-3f9362681253_1216x490.png 848w, https://substackcdn.com/image/fetch/$s_!DeTc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F935f2dee-efe5-4c85-adcf-3f9362681253_1216x490.png 1272w, https://substackcdn.com/image/fetch/$s_!DeTc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F935f2dee-efe5-4c85-adcf-3f9362681253_1216x490.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DeTc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F935f2dee-efe5-4c85-adcf-3f9362681253_1216x490.png" width="728" height="293.35526315789474" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/935f2dee-efe5-4c85-adcf-3f9362681253_1216x490.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:490,&quot;width&quot;:1216,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!DeTc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F935f2dee-efe5-4c85-adcf-3f9362681253_1216x490.png 424w, https://substackcdn.com/image/fetch/$s_!DeTc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F935f2dee-efe5-4c85-adcf-3f9362681253_1216x490.png 848w, https://substackcdn.com/image/fetch/$s_!DeTc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F935f2dee-efe5-4c85-adcf-3f9362681253_1216x490.png 1272w, https://substackcdn.com/image/fetch/$s_!DeTc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F935f2dee-efe5-4c85-adcf-3f9362681253_1216x490.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why Care?</strong></p><ul><li><p>Understanding what genres dominate each format (Movies vs. TV Shows) helps with personalization. For instance, a user who loves <em>Romantic TV Shows</em> but avoids <em>Action Movies</em> might prefer different content recommendations.</p></li><li><p>For a production strategy, Netflix might spot gaps in certain genres.</p></li></ul><p><strong>Finding:</strong></p><ul><li><p>Movies and TV shows have overlapping genres (e.g., Drama, Comedy), but the top 10 for each format highlights distinct user interests.</p></li><li><p>The bar plots reveal which 10 genres are most popular in each category. For example, you might see that <strong>Drama</strong> is #1 for Movies, while <strong>International TV</strong> ranks highly for TV Shows.</p></li></ul><p><strong>ML Angle:</strong></p><ul><li><p><strong>Genre-based embeddings</strong>: Each genre can become a feature in a recommendation model. If a user frequently watches a certain genre, your algorithm can boost those titles.</p></li><li><p><strong>Content clustering</strong>: Grouping titles by genre could help you tailor personalized recommendations or identify underserved niches.</p></li></ul><div class="pullquote"><p><em>&#128142; Here&#8217;s a snippet of the code. The <strong>full notebook</strong>, including all the code, will be sent <strong>exclusively to paid subscribers on Wednesday</strong>. This is a <strong>one-time send</strong>&#8212;only subscribers with an active paid membership at that time will receive it via email.&#128142;</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://mlpills.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://mlpills.substack.com/subscribe?"><span>Subscribe now</span></a></p></div><pre><code><code># Top 10 Genres for Movies
netflix_df[netflix_df["type"]=="Movie"]["genre"]
    .value_counts()[:10]
    .plot(kind='barh', color=colors)

# Top 10 Genres for TV Shows
netflix_df[netflix_df["type"]=="TV Show"]["genre"]
    .value_counts()[:10]
    .plot(kind='barh', color=colors)</code></code></pre><p><strong>Business Take:</strong></p><ul><li><p>If certain genres are strong in TV Shows but weaker in Movies (or vice versa), Netflix might invest more in those unbalanced areas to broaden appeal or double down on existing strengths.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rS_C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccf9a0-47bc-497a-93ba-35747b1c5237_1268x490.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rS_C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccf9a0-47bc-497a-93ba-35747b1c5237_1268x490.png 424w, https://substackcdn.com/image/fetch/$s_!rS_C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccf9a0-47bc-497a-93ba-35747b1c5237_1268x490.png 848w, https://substackcdn.com/image/fetch/$s_!rS_C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccf9a0-47bc-497a-93ba-35747b1c5237_1268x490.png 1272w, https://substackcdn.com/image/fetch/$s_!rS_C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccf9a0-47bc-497a-93ba-35747b1c5237_1268x490.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rS_C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccf9a0-47bc-497a-93ba-35747b1c5237_1268x490.png" width="1268" height="490" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6eccf9a0-47bc-497a-93ba-35747b1c5237_1268x490.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:490,&quot;width&quot;:1268,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rS_C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccf9a0-47bc-497a-93ba-35747b1c5237_1268x490.png 424w, https://substackcdn.com/image/fetch/$s_!rS_C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccf9a0-47bc-497a-93ba-35747b1c5237_1268x490.png 848w, https://substackcdn.com/image/fetch/$s_!rS_C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccf9a0-47bc-497a-93ba-35747b1c5237_1268x490.png 1272w, https://substackcdn.com/image/fetch/$s_!rS_C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eccf9a0-47bc-497a-93ba-35747b1c5237_1268x490.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li></ul><h2>7. Top 10 Actors on Netflix</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i6LJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa67fc5-aa69-4a14-9217-bea1715626c5_1122x541.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i6LJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa67fc5-aa69-4a14-9217-bea1715626c5_1122x541.png 424w, https://substackcdn.com/image/fetch/$s_!i6LJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa67fc5-aa69-4a14-9217-bea1715626c5_1122x541.png 848w, https://substackcdn.com/image/fetch/$s_!i6LJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa67fc5-aa69-4a14-9217-bea1715626c5_1122x541.png 1272w, https://substackcdn.com/image/fetch/$s_!i6LJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa67fc5-aa69-4a14-9217-bea1715626c5_1122x541.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i6LJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa67fc5-aa69-4a14-9217-bea1715626c5_1122x541.png" width="1122" height="541" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/daa67fc5-aa69-4a14-9217-bea1715626c5_1122x541.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:541,&quot;width&quot;:1122,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:34332,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/161218333?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa67fc5-aa69-4a14-9217-bea1715626c5_1122x541.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!i6LJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa67fc5-aa69-4a14-9217-bea1715626c5_1122x541.png 424w, https://substackcdn.com/image/fetch/$s_!i6LJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa67fc5-aa69-4a14-9217-bea1715626c5_1122x541.png 848w, https://substackcdn.com/image/fetch/$s_!i6LJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa67fc5-aa69-4a14-9217-bea1715626c5_1122x541.png 1272w, https://substackcdn.com/image/fetch/$s_!i6LJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdaa67fc5-aa69-4a14-9217-bea1715626c5_1122x541.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why Care?</strong></p><ul><li><p>Star power can drive viewer engagement and subscriptions. Think about how big names (e.g., Adam Sandler or Shah Rukh Khan) can bring massive audiences.</p></li><li><p>Understanding who appears most often may inform how Netflix negotiates or markets content.</p></li></ul><p><strong>Finding:</strong></p><ul><li><p>By splitting the <code>cast</code> column (handling missing values), we see which actors show up most across the catalog. The top 10 often includes frequent collaborators or multi-title deals (especially with Netflix originals).</p></li></ul><p><strong>ML Angle:</strong></p><ul><li><p><strong>Actor-based recommendations</strong>: If a user is a huge fan of Actor X, your model can surface more content starring Actor X.</p></li><li><p><strong>Popularity metric</strong>: Actors who appear more frequently could weigh heavily in user engagement predictions.</p></li></ul><pre><code><code># Create a DataFrame of actors
netflix_df['cast'] = netflix_df['cast'].fillna('No Cast Specified') 
filtered_cast = netflix_df['cast'].str.split(',',expand=True).stack().to_frame()
filtered_cast.columns = ['Actor']
actors = filtered_cast.groupby(['Actor']).size().reset_index(name='Total Content')
actors = actors[actors.Actor !='No Cast Specified'] 

top_actors = actors.head(10).sort_values(by=['Total Content'])
x = top_actors["Actor"]
y = top_actors["Total Content"]

sns.barplot(x=x, y=y)
</code></code></pre><p><strong>Business Take:</strong></p><ul><li><p>Actors with a high volume of content on Netflix can be central to marketing campaigns or further deals. Netflix might also spot up-and-coming talents who appear in multiple successful titles.</p></li></ul><div><hr></div><h1><strong>&#8205;&#127891;Further Learning*</strong></h1><p>Let us present: &#8220;<a href="https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev?ref=3b122f">From Beginner to Advanced LLM Developer</a>&#8221;. This comprehensive course takes you <strong>from foundational skills to mastering scalable LLM products</strong> through <em>hands-on projects, fine-tuning, RAG, and agent development</em>. Whether you're building a standout portfolio, launching a startup idea, or enhancing enterprise solutions, this program equips you to lead the LLM revolution and thrive in a fast-growing, in-demand field.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev?ref=3b122f" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6iMW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 424w, https://substackcdn.com/image/fetch/$s_!6iMW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 848w, https://substackcdn.com/image/fetch/$s_!6iMW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 1272w, https://substackcdn.com/image/fetch/$s_!6iMW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6iMW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png" width="612" height="338.2105263157895" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:420,&quot;width&quot;:760,&quot;resizeWidth&quot;:612,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev?ref=3b122f&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!6iMW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 424w, https://substackcdn.com/image/fetch/$s_!6iMW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 848w, https://substackcdn.com/image/fetch/$s_!6iMW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 1272w, https://substackcdn.com/image/fetch/$s_!6iMW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Who Is This Course For?</strong></p><p>This certification is for software developers, machine learning engineers, data scientists or computer science and AI students to rapidly convert to an LLM Developer role and start building</p><p><em>*Sponsored: by purchasing any of their courses you would also be supporting MLPills.</em></p><div><hr></div><h2>8. Movies vs. TV Shows Over Time</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eSDG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf86d01-5165-44ab-8543-8e82748ef539_1006x563.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eSDG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf86d01-5165-44ab-8543-8e82748ef539_1006x563.png 424w, https://substackcdn.com/image/fetch/$s_!eSDG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf86d01-5165-44ab-8543-8e82748ef539_1006x563.png 848w, https://substackcdn.com/image/fetch/$s_!eSDG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf86d01-5165-44ab-8543-8e82748ef539_1006x563.png 1272w, https://substackcdn.com/image/fetch/$s_!eSDG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf86d01-5165-44ab-8543-8e82748ef539_1006x563.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eSDG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf86d01-5165-44ab-8543-8e82748ef539_1006x563.png" width="1006" height="563" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/baf86d01-5165-44ab-8543-8e82748ef539_1006x563.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:563,&quot;width&quot;:1006,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:54484,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/161218333?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf86d01-5165-44ab-8543-8e82748ef539_1006x563.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eSDG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf86d01-5165-44ab-8543-8e82748ef539_1006x563.png 424w, https://substackcdn.com/image/fetch/$s_!eSDG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf86d01-5165-44ab-8543-8e82748ef539_1006x563.png 848w, https://substackcdn.com/image/fetch/$s_!eSDG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf86d01-5165-44ab-8543-8e82748ef539_1006x563.png 1272w, https://substackcdn.com/image/fetch/$s_!eSDG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbaf86d01-5165-44ab-8543-8e82748ef539_1006x563.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why Care?</strong></p><ul><li><p>Is Netflix pivoting more to TV Shows or sticking with Movies? Over time, user behavior and content costs can shape these strategies.</p></li><li><p>This also reveals the shifting focus from 2005&#8211;2018 (or whichever date range you choose).</p></li></ul><p><strong>Finding:</strong></p>
      <p>
          <a href="https://mlpills.substack.com/p/rw-4-eda-applied-to-netflix-part">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[RW #3 - EDA applied to Netflix (part I)]]></title><description><![CDATA[Exploratory Data Analysis (EDA) is the foundation of any data-driven project. It&#8217;s where you get your first "feel" of the data &#8212; what hides beneath, what patterns emerge, and where problems lurk. Think of EDA as mapping uncharted territory before building anything on top.In this week's pill, we're breaking down the first part of a practical EDA series using Netflix&#8217;s Movies and TV Shows dataset. You'll learn why and how EDA matters, and how every plot you generate now can feed into your Machine Learning pipeline later.]]></description><link>https://mlpills.substack.com/p/rw-3-eda-applied-to-netflix-part</link><guid isPermaLink="false">https://mlpills.substack.com/p/rw-3-eda-applied-to-netflix-part</guid><dc:creator><![CDATA[David Andrés]]></dc:creator><pubDate>Sun, 30 Mar 2025 06:30:57 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/1925bf0f-1040-449a-98b6-259d82b46c10_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>&#128138; Pill of the Week</h1><p><strong>Exploratory Data Analysis (EDA) is the foundation of any data-driven project.</strong> It&#8217;s where you get your first "feel" of the data &#8212; what hides beneath, what patterns emerge, and where problems lurk. Think of EDA as mapping uncharted territory before building anything on top.</p><p>In this week's pill, we're breaking down the <strong>first part of a practical EDA</strong> series using <strong>Netflix&#8217;s Movies and TV Shows dataset</strong>. You'll learn why and how EDA matters, and how every plot you generate now can feed into your Machine Learning pipeline later.</p><blockquote><p><em>&#9999;&#65039; Article and code by <a href="https://open.substack.com/users/236084597-muhammad-anas?utm_source=mentions">Muhammad Anas</a>.</em></p></blockquote><p></p><h2>What is this EDA Series?</h2><p>Over multiple parts, we&#8217;ll:</p><ul><li><p>Theoretically explain <strong>why EDA matters</strong></p></li><li><p>Use Netflix data to practice &#8212; because who doesn&#8217;t binge Netflix?</p></li><li><p>Show how every insight <strong>translates into ML applications</strong></p></li></ul><p></p><h2>Why Do EDA?</h2><p>Exploratory Data Analysis (EDA) is a crucial first step in any data science or machine learning project. It helps you:</p><p>&#9989; <strong>Understand Data Distributions</strong><br>Gain insight into how your variables are spread&#8212;are they skewed, normally distributed, or full of surprises?</p><p>&#9989; <strong>Detect Missing Values, Outliers, and Inconsistencies</strong><br>Spot issues early&#8212;missing data, anomalous values, or strange patterns that could skew your analysis or mislead your models.</p><p>&#9989; <strong>Discover Relationships Between Variables</strong><br>Identify trends, correlations, and potential causal links. This helps guide both your modeling approach and business interpretation.</p><p>&#9989; <strong>Inform Feature Engineering for ML Models</strong><br>EDA reveals patterns and data quirks that can inspire the creation of powerful new features&#8212;or the removal of redundant ones.</p><p>&#9989; <strong>Refine Business Questions and Assumptions</strong><br>Sometimes the data tells a different story than expected. EDA helps align your hypotheses with reality and may uncover new questions worth asking.</p><blockquote><p>&#128269; <strong>Reminder:</strong> <em>Garbage in, garbage out.</em><br>Good EDA saves you from wasting time building models on messy, misleading, or misunderstood data. Think of it as the detective work that sets the stage for everything else.</p></blockquote><p>Do you want more details? Check our <a href="https://mlpills.substack.com/p/issue-68-exploratory-data-analysis">previous MLPills issue</a>:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;48979918-82fd-4461-b81e-b1cbb4327bf3&quot;,&quot;caption&quot;:&quot;&#128138; Pill of the Week&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Issue #67 - Exploratory Data Analysis&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:38707812,&quot;name&quot;:&quot;David Andr&#233;s&quot;,&quot;bio&quot;:&quot;&#128188; Data Scientist &#8226; &#128013; Python enthusiast&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db6423b2-36bc-440c-be7d-b54be5bad1b0_1447x1448.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null},{&quot;id&quot;:132707413,&quot;name&quot;:&quot;Josep Ferrer&quot;,&quot;bio&quot;:&quot;Outstand using data -- Data Science, Design and Tech Tech Writer @KDnuggets @DataCamp &#128073;&#127995;Inquiries in rfeers@gmail.com&quot;,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd196b5a6-59f2-46dd-99b3-e10ab1bbd27d_604x604.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-07-28T09:18:32.770Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b748a3f-2322-47f8-a36b-680b2fd2affe_1920x1440.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://mlpills.substack.com/p/issue-68-exploratory-data-analysis&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:147069309,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:23,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Machine Learning Pills&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dba4244-97d2-48f0-a2bb-b01c7ea74212_118x118.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p></p><h2>0. Dataset - Netflix Movies &amp; TV Shows</h2><p><strong>What&#8217;s inside:</strong></p><ul><li><p>8,807 titles from Netflix (as of 2021)</p></li><li><p>Columns: title, cast, director, country, release year, rating, duration, genre</p></li></ul><p>The dataset is simple but loaded with insights about what Netflix adds, when, and from where.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.kaggle.com/datasets/shivamb/netflix-shows&quot;,&quot;text&quot;:&quot;Get the dataset&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.kaggle.com/datasets/shivamb/netflix-shows"><span>Get the dataset</span></a></p><p></p><h2>1. Release Year Distribution</h2><p><strong>Why care:</strong> Content release years tell us if Netflix focuses more on newer or older content.</p><p><strong>Finding:</strong> There&#8217;s a dramatic rise in content starting around 2010, peaking in 2018 &#8212; the year with the most titles released. Prior to 2000, content is sparse, with only a trickle of titles from the 20th century, including a few surprises from as far back as 1925.</p><p><strong>ML Angle:</strong></p><ul><li><p><strong>Time-decay features:</strong> Recency bias in content could influence recommendation ranking.</p></li><li><p><strong>Content longevity modeling:</strong> Which older titles continue to perform well?</p></li></ul><p>&#128293; <strong>Pro Tip:</strong> If you&#8217;re pitching content to Netflix, target trends from the post-2015 surge &#8212; that&#8217;s when they were clearly scaling aggressively.</p><div class="pullquote"><p>&#128142; Here&#8217;s a snippet of the code. The <strong>full notebook</strong>, including all the code, will be sent exclusively <strong>to paid subscribers on Wednesday</strong>. This is a <strong>one-time send</strong>&#8212;only subscribers with an active <strong>paid membership at that time</strong> will receive it via email.&#128142;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://mlpills.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://mlpills.substack.com/subscribe?"><span>Subscribe now</span></a></p></div><pre><code>netflix_df["release_year"]
    .value_counts()
    .plot.barh(
        figsize=(30, 20),
        color="#32a883"
    )</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ohxg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef7a463-300f-45c9-8def-932c02f91b3c_2300x1500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ohxg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef7a463-300f-45c9-8def-932c02f91b3c_2300x1500.png 424w, https://substackcdn.com/image/fetch/$s_!Ohxg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef7a463-300f-45c9-8def-932c02f91b3c_2300x1500.png 848w, https://substackcdn.com/image/fetch/$s_!Ohxg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef7a463-300f-45c9-8def-932c02f91b3c_2300x1500.png 1272w, https://substackcdn.com/image/fetch/$s_!Ohxg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef7a463-300f-45c9-8def-932c02f91b3c_2300x1500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ohxg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef7a463-300f-45c9-8def-932c02f91b3c_2300x1500.png" width="1456" height="950" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ef7a463-300f-45c9-8def-932c02f91b3c_2300x1500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:950,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:243004,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/159603757?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef7a463-300f-45c9-8def-932c02f91b3c_2300x1500.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ohxg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef7a463-300f-45c9-8def-932c02f91b3c_2300x1500.png 424w, https://substackcdn.com/image/fetch/$s_!Ohxg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef7a463-300f-45c9-8def-932c02f91b3c_2300x1500.png 848w, https://substackcdn.com/image/fetch/$s_!Ohxg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef7a463-300f-45c9-8def-932c02f91b3c_2300x1500.png 1272w, https://substackcdn.com/image/fetch/$s_!Ohxg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef7a463-300f-45c9-8def-932c02f91b3c_2300x1500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The release year distribution clearly shows Netflix&#8217;s aggressive expansion post-2010, aligning with its pivot from distributor to producer. 2018 stands out as the peak year, followed closely by adjacent years. Older content (pre-2000s) is minimal &#8212; likely classic films added for niche interest. The long tail before 2000 emphasizes that Netflix&#8217;s library is overwhelmingly modern, reflecting a strategy that favors recent, high-engagement content over archival depth.</p><p></p><h2><strong>2. Type of Shows - Movies vs TV Shows</strong></h2><p><strong>Why care:</strong> Determines how users engage &#8212; movies offer quick hits, while shows drive long-term retention.</p><p><strong>Finding:</strong> Movies slightly outnumber TV Shows, but not by much. The split is relatively balanced, signaling that both formats are core to Netflix&#8217;s content strategy.</p><p><strong>ML Angle:</strong></p><ul><li><p><strong>Consumption patterns:</strong> A user who watches more TV Shows might prefer serialized storytelling and be less churn-prone.</p></li><li><p><strong>Model feature:</strong> A simple binary feature like <code>is_series</code> can significantly impact predictions for engagement or completion rates.</p></li></ul><p>&#128161; <strong>Fun Fact:</strong> TV series retention rate is Netflix&#8217;s secret growth hack &#8212; that's why you auto-play into the next episode.</p><pre><code>sns<strong>.</strong>barplot(x<strong>=</strong>netflix_df['type'], y<strong>=</strong>netflix_df<strong>.</strong>index)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H792!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbdba024-a51e-4ec5-af50-84aadab017e9_1634x890.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H792!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbdba024-a51e-4ec5-af50-84aadab017e9_1634x890.png 424w, https://substackcdn.com/image/fetch/$s_!H792!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbdba024-a51e-4ec5-af50-84aadab017e9_1634x890.png 848w, https://substackcdn.com/image/fetch/$s_!H792!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbdba024-a51e-4ec5-af50-84aadab017e9_1634x890.png 1272w, https://substackcdn.com/image/fetch/$s_!H792!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbdba024-a51e-4ec5-af50-84aadab017e9_1634x890.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H792!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbdba024-a51e-4ec5-af50-84aadab017e9_1634x890.png" width="1456" height="793" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbdba024-a51e-4ec5-af50-84aadab017e9_1634x890.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:793,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:81016,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/159603757?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbdba024-a51e-4ec5-af50-84aadab017e9_1634x890.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H792!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbdba024-a51e-4ec5-af50-84aadab017e9_1634x890.png 424w, https://substackcdn.com/image/fetch/$s_!H792!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbdba024-a51e-4ec5-af50-84aadab017e9_1634x890.png 848w, https://substackcdn.com/image/fetch/$s_!H792!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbdba024-a51e-4ec5-af50-84aadab017e9_1634x890.png 1272w, https://substackcdn.com/image/fetch/$s_!H792!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbdba024-a51e-4ec5-af50-84aadab017e9_1634x890.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The bar chart confirms that movies still hold the majority, but the gap is narrower than expected &#8212; suggesting that Netflix invests heavily in both formats. This balance reflects two strategies: movies provide quick gratification and variety, while TV shows create long-term user hooks. For ML models, distinguishing between these types can inform everything from watch-time predictions to churn modeling.</p><p></p><h2><strong>3. Netflix Ratings Distribution - Is Netflix kid-friendly?</strong></h2><p><strong>Why care:</strong> Ratings define the audience &#8212; mature vs family content can drastically shift engagement, retention, and trust.</p><p><strong>Finding:</strong> <em>TV-MA</em> is by far the most dominant rating, followed by <em>TV-14</em> and <em>TV-PG</em>. Kid-friendly ratings like <em>TV-Y</em>, <em>TV-G</em>, and <em>G</em> appear far less frequently. A few irregular entries like &#8220;min84&#8221; or &#8220;74 min&#8221; likely reflect data entry errors.</p><p><strong>ML Angle:</strong></p><ul><li><p><strong>Parental controls &amp; content filters:</strong> Ratings help segment users for safe content recommendations.</p></li><li><p><strong>User profiling:</strong> Ratings can help predict preferences &#8212; e.g., users who watch PG content may churn if served too much TV-MA.</p></li></ul><p><strong>&#128161; Nugget:</strong> Planning a content platform? Use rating-based filters to target niche audiences (e.g., family-only, horror buffs, etc.).</p><pre><code>sns<strong>.</strong>countplot(x<strong>=</strong>'rating', data <strong>=</strong> netflix_df)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jEOh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb161e7a6-25c4-49be-a074-943487320986_2342x1270.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jEOh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb161e7a6-25c4-49be-a074-943487320986_2342x1270.png 424w, https://substackcdn.com/image/fetch/$s_!jEOh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb161e7a6-25c4-49be-a074-943487320986_2342x1270.png 848w, https://substackcdn.com/image/fetch/$s_!jEOh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb161e7a6-25c4-49be-a074-943487320986_2342x1270.png 1272w, https://substackcdn.com/image/fetch/$s_!jEOh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb161e7a6-25c4-49be-a074-943487320986_2342x1270.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jEOh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb161e7a6-25c4-49be-a074-943487320986_2342x1270.png" width="1456" height="790" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b161e7a6-25c4-49be-a074-943487320986_2342x1270.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:790,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:154656,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/159603757?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb161e7a6-25c4-49be-a074-943487320986_2342x1270.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jEOh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb161e7a6-25c4-49be-a074-943487320986_2342x1270.png 424w, https://substackcdn.com/image/fetch/$s_!jEOh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb161e7a6-25c4-49be-a074-943487320986_2342x1270.png 848w, https://substackcdn.com/image/fetch/$s_!jEOh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb161e7a6-25c4-49be-a074-943487320986_2342x1270.png 1272w, https://substackcdn.com/image/fetch/$s_!jEOh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb161e7a6-25c4-49be-a074-943487320986_2342x1270.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The distribution shows a heavy lean toward mature content &#8212; <em>TV-MA</em> leads with over 3,000 titles, clearly positioning Netflix as an adult-first platform. <em>TV-14</em> and <em>TV-PG</em> add some balance, appealing to teens and broader audiences. However, content for young children is relatively sparse, with minimal titles rated <em>TV-Y</em>, <em>G</em>, or <em>TV-Y7</em>. The presence of non-standard ratings like &#8220;min84&#8221; underscores the importance of cleaning and validating categorical data in real-world datasets.</p><p></p><h2><strong>4. Ratings Distribution by Type (Movies/TV Shows)</strong></h2><p><strong>Why care:</strong> Understanding how content type correlates with audience maturity helps tailor recommendations and refine user segments.</p><p><strong>Finding:</strong></p><ul><li><p><strong>Movies</strong> dominate the mature categories (<em>TV-MA</em>, <em>R</em>, <em>TV-14</em>), indicating a strong focus on adult and teen content.</p></li><li><p><strong>TV Shows</strong> are more concentrated in <em>TV-MA</em> and <em>TV-14</em>, but have a slightly better spread in family-friendly ratings like <em>TV-Y</em> and <em>TV-Y7</em>.</p></li></ul><p><strong>ML Angle:</strong></p><ul><li><p><strong>Interaction features:</strong> Combining <code>type</code> and <code>rating</code> can enhance model accuracy &#8212; a user watching PG movies may behave differently than one watching PG shows.</p></li><li><p><strong>Personalization layers:</strong> ML models can adapt recommendations by preferred content tone <em>and</em> format.</p></li></ul><p><strong>&#128161; Business Insight:</strong> Family-focused platforms could capitalize on Netflix&#8217;s thin children's catalog &#8212; especially in the TV show space.</p><div class="pullquote"><p>Here some code, remember, <strong>next Wednesday</strong> we will share it <strong>in full</strong> for all <strong>paid subscribers</strong>. A <strong>one-time send.</strong></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://mlpills.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://mlpills.substack.com/subscribe?"><span>Subscribe now</span></a></p></div><pre><code>sns.countplot(
    x="rating",
    data=netflix_df,
    hue="type",
    palette=netflix_df["rating"].map(color_map)
)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bgiV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94cd85b5-e13a-4ed1-a337-1c7fd7d2f846_2342x1294.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bgiV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94cd85b5-e13a-4ed1-a337-1c7fd7d2f846_2342x1294.png 424w, https://substackcdn.com/image/fetch/$s_!bgiV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94cd85b5-e13a-4ed1-a337-1c7fd7d2f846_2342x1294.png 848w, https://substackcdn.com/image/fetch/$s_!bgiV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94cd85b5-e13a-4ed1-a337-1c7fd7d2f846_2342x1294.png 1272w, https://substackcdn.com/image/fetch/$s_!bgiV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94cd85b5-e13a-4ed1-a337-1c7fd7d2f846_2342x1294.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bgiV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94cd85b5-e13a-4ed1-a337-1c7fd7d2f846_2342x1294.png" width="1456" height="804" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/94cd85b5-e13a-4ed1-a337-1c7fd7d2f846_2342x1294.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:804,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:195093,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/159603757?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94cd85b5-e13a-4ed1-a337-1c7fd7d2f846_2342x1294.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bgiV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94cd85b5-e13a-4ed1-a337-1c7fd7d2f846_2342x1294.png 424w, https://substackcdn.com/image/fetch/$s_!bgiV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94cd85b5-e13a-4ed1-a337-1c7fd7d2f846_2342x1294.png 848w, https://substackcdn.com/image/fetch/$s_!bgiV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94cd85b5-e13a-4ed1-a337-1c7fd7d2f846_2342x1294.png 1272w, https://substackcdn.com/image/fetch/$s_!bgiV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F94cd85b5-e13a-4ed1-a337-1c7fd7d2f846_2342x1294.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The distribution shows that <strong>Movies skew more adult</strong>, with higher counts in <em>TV-MA</em>, <em>TV-14</em>, and <em>R</em>. TV Shows also lean mature but offer relatively more in child-safe categories (<em>TV-Y</em>, <em>TV-Y7</em>, <em>TV-G</em>), possibly due to serialized educational/kids&#8217; content. These rating-pattern differences suggest distinct audience strategies: movies deliver intensity and range, while TV shows balance bingeable maturity with broader age appeal.</p><p></p><h2><strong>5. Top 5 Netflix Countries - Who&#8217;s producing what you binge?</strong></h2><p><strong>Why care:</strong> Regional content diversity drives global subscriber growth. Netflix&#8217;s reach depends on balancing domestic appeal with international flavor.</p><p><strong>Finding:</strong></p><ul><li><p>The <strong>United States</strong> dominates with a massive 66.5% of content.</p></li><li><p><strong>India</strong> follows with 17.7%, a strong showing driven by Bollywood and regional productions.</p></li><li><p><strong>United Kingdom</strong> contributes 7.6%, while <strong>Japan</strong> (4.5%) and <strong>South Korea</strong> (3.6%) round out the top 5.</p></li></ul><p><strong>ML Angle:</strong></p><ul><li><p><strong>Geo-personalization:</strong> Country tags can power location-aware recommendations.</p></li><li><p><strong>Forecasting trends:</strong> Historical production data can predict future regional content expansion.</p></li></ul><p><strong>&#128161; Business Twist:</strong> There's white space for non-US players &#8212; especially platforms targeting Asian or European markets with local-first libraries.</p><pre><code>netflix_df["country"]
    .value_counts()
    .nlargest(n=5)
    .plot.pie(
        autopct="%1.1f%%",
        figsize=(15, 10),
        colors=colors
    )</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9I0R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b42927-7ef3-4a26-b660-3820dac3368c_1748x1408.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9I0R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b42927-7ef3-4a26-b660-3820dac3368c_1748x1408.png 424w, https://substackcdn.com/image/fetch/$s_!9I0R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b42927-7ef3-4a26-b660-3820dac3368c_1748x1408.png 848w, https://substackcdn.com/image/fetch/$s_!9I0R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b42927-7ef3-4a26-b660-3820dac3368c_1748x1408.png 1272w, https://substackcdn.com/image/fetch/$s_!9I0R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b42927-7ef3-4a26-b660-3820dac3368c_1748x1408.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9I0R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b42927-7ef3-4a26-b660-3820dac3368c_1748x1408.png" width="630" height="507.5480769230769" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62b42927-7ef3-4a26-b660-3820dac3368c_1748x1408.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1173,&quot;width&quot;:1456,&quot;resizeWidth&quot;:630,&quot;bytes&quot;:257488,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpills.substack.com/i/159603757?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b42927-7ef3-4a26-b660-3820dac3368c_1748x1408.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9I0R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b42927-7ef3-4a26-b660-3820dac3368c_1748x1408.png 424w, https://substackcdn.com/image/fetch/$s_!9I0R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b42927-7ef3-4a26-b660-3820dac3368c_1748x1408.png 848w, https://substackcdn.com/image/fetch/$s_!9I0R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b42927-7ef3-4a26-b660-3820dac3368c_1748x1408.png 1272w, https://substackcdn.com/image/fetch/$s_!9I0R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62b42927-7ef3-4a26-b660-3820dac3368c_1748x1408.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Netflix's catalog is heavily skewed toward U.S.-produced content, accounting for over two-thirds of the platform. India, with nearly 18%, signals strong growth and a mobile-first audience hungry for local stories. The UK holds steady with high-quality exports, while Japan and South Korea &#8212; though culturally influential &#8212; contribute relatively fewer titles. This suggests high impact per title for East Asian countries, and possibly reflects a stronger focus on quality or selective licensing rather than sheer volume.</p><p></p><h2><strong>Wrapping Up Part I - Stay for the next binge!</strong></h2><p>So far, we: </p><ul><li><p>&#9989; Time-traveled through Netflix&#8217;s content library </p></li><li><p>&#9989; Counted if you&#8217;re likely binging TV or movies </p></li><li><p>&#9989;Saw how spicy Netflix really is </p></li><li><p>&#9989; Found Netflix&#8217;s global production hubs</p></li></ul><div class="pullquote"><p><em><strong>&#9888;&#65039;REMEMBER&#9888;&#65039;</strong></em></p><p>&#128142; The <strong>full notebook</strong>, including all the code, will be sent exclusively <strong>to paid subscribers on Wednesday</strong>. This is a <strong>one-time send</strong>&#8212;only subscribers with an active <strong>paid membership at that time</strong> will receive it via email.&#128142;</p><p><strong>&#128467;&#65039; Wednesday 2nd of April &#128467;&#65039;</strong></p></div><h2>What&#8217;s Next in Part II?</h2><p>In the next pill, we tackle:</p><ul><li><p>Genre breakdowns</p></li><li><p>Actor/director impact analysis</p></li><li><p>Content addition seasonality trends</p></li></ul><p>All with one goal: <strong>prepping these insights for real ML pipelines</strong>.</p><p><em>Stay tuned, and remember &#8212; better EDA = better models.</em></p><div><hr></div><h1><strong>&#8205;&#127891;Further Learning*</strong></h1><p>Let us present: &#8220;<a href="https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev?ref=3b122f">From Beginner to Advanced LLM Developer</a>&#8221;. This comprehensive course takes you <strong>from foundational skills to mastering scalable LLM products</strong> through <em>hands-on projects, fine-tuning, RAG, and agent development</em>. Whether you're building a standout portfolio, launching a startup idea, or enhancing enterprise solutions, this program equips you to lead the LLM revolution and thrive in a fast-growing, in-demand field.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev?ref=3b122f" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6iMW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 424w, https://substackcdn.com/image/fetch/$s_!6iMW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 848w, https://substackcdn.com/image/fetch/$s_!6iMW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 1272w, https://substackcdn.com/image/fetch/$s_!6iMW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6iMW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png" width="612" height="338.2105263157895" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:420,&quot;width&quot;:760,&quot;resizeWidth&quot;:612,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:&quot;https://academy.towardsai.net/courses/beginner-to-advanced-llm-dev?ref=3b122f&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!6iMW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 424w, https://substackcdn.com/image/fetch/$s_!6iMW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 848w, https://substackcdn.com/image/fetch/$s_!6iMW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 1272w, https://substackcdn.com/image/fetch/$s_!6iMW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f077642-4adc-4b0f-8afc-2c1ea26f05ab_760x420.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Who Is This Course For?</strong></p><p>This certification is for software developers, machine learning engineers, data scientists or computer science and AI students to rapidly convert to an LLM Developer role and start building</p><p><em>*Sponsored: by purchasing any of their courses you would also be supporting MLPills.</em></p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpills.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Machine Learning Pills is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>
      <p>
          <a href="https://mlpills.substack.com/p/rw-3-eda-applied-to-netflix-part">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[RW #2 - AI Agents and Vertical SaaS]]></title><description><![CDATA[Artificial Intelligence is driving a paradigm shift across industries by enhancing efficiency, enabling automation, and solving complex problems. At the forefront of this change are AI agents and Vertical SaaS solutions. While AI agents act autonomously to streamline processes, Vertical SaaS platforms provide tailored solutions to meet the unique needs of specific industries. For data scientists and ML engineers, understanding these trends is essential to leverage their full potential.]]></description><link>https://mlpills.substack.com/p/rw-2-ai-agents-and-vertical-saas</link><guid isPermaLink="false">https://mlpills.substack.com/p/rw-2-ai-agents-and-vertical-saas</guid><dc:creator><![CDATA[David Andrés]]></dc:creator><pubDate>Sun, 15 Dec 2024 19:36:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e7676cbb-996e-47f5-8079-ab1bdcb73dc0_1920x1280.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Today, we&#8217;ll explore AI Agents&#8212;what they are, their potential applications across industries, and their future impact on the real world.</strong></p><p>Before diving in, I want to acknowledge that this issue was created in collaboration with our friends at <em><a href="https://everydaynews.substack.com/">Everyday | AI for All</a>.</em></p><p>Let&#8217;s get started!</p><div><hr></div><h1>&#128138; <strong>Pill of the Week</strong></h1><p>Artificial Intelligence is driving a paradigm shift across industries by enhancing efficiency, enabling automation, and solving complex problems. At the forefront of this change are <strong>AI agents</strong> and <strong>Vertical SaaS</strong> solutions. While AI agents act autonomously to streamline processes, Vertical SaaS platforms provide tailored solutions to meet the unique needs of specific industries. For data scientists and ML engineers, understanding these trends is essential to leverage their full potential.</p><p></p><h2><strong>AI Agents</strong></h2><p>AI agents are <strong>autonomous systems designed to perform specific tasks, often simulating human-like decision-making</strong>. Unlike traditional algorithms, AI agents combine perception, reasoning, and action to achieve their objectives. These agents can interact with their environment, process inputs, and make decisions to accomplish tasks with minimal human intervention. Their ability to act independently based on inputs and objectives makes them highly valuable for solving real-world challenges.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8E5l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcab7f02-5317-4206-893a-36f21c30d6ff_964x194.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8E5l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcab7f02-5317-4206-893a-36f21c30d6ff_964x194.png 424w, https://substackcdn.com/image/fetch/$s_!8E5l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcab7f02-5317-4206-893a-36f21c30d6ff_964x194.png 848w, https://substackcdn.com/image/fetch/$s_!8E5l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcab7f02-5317-4206-893a-36f21c30d6ff_964x194.png 1272w, https://substackcdn.com/image/fetch/$s_!8E5l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcab7f02-5317-4206-893a-36f21c30d6ff_964x194.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8E5l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcab7f02-5317-4206-893a-36f21c30d6ff_964x194.png" width="964" height="194" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bcab7f02-5317-4206-893a-36f21c30d6ff_964x194.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:194,&quot;width&quot;:964,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:37866,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8E5l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcab7f02-5317-4206-893a-36f21c30d6ff_964x194.png 424w, https://substackcdn.com/image/fetch/$s_!8E5l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcab7f02-5317-4206-893a-36f21c30d6ff_964x194.png 848w, https://substackcdn.com/image/fetch/$s_!8E5l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcab7f02-5317-4206-893a-36f21c30d6ff_964x194.png 1272w, https://substackcdn.com/image/fetch/$s_!8E5l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbcab7f02-5317-4206-893a-36f21c30d6ff_964x194.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>At their core, AI agents operate in a continuous loop of <strong>sense &#8594; think &#8594; act</strong>, mirroring how humans process information and respond to their environment:</p><ol><li><p><strong>Sense</strong>: The agent gathers data from its environment using APIs, sensors, or other inputs. This sensing phase is analogous to human perception through our senses, where the agent must filter relevant information from noise and organize inputs meaningfully.</p></li><li><p><strong>Think</strong>: It processes this data, often leveraging machine learning models, to determine the best course of action. This cognitive phase involves pattern recognition, decision-making under uncertainty, and strategic planning &#8211; similar to how humans analyze situations and weigh options.</p></li><li><p><strong>Act</strong>: The agent takes actions based on its reasoning, which might involve updating a database, responding to a user, or triggering processes. These actions then create new situations that the agent must sense and respond to, continuing the cycle.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://everydaynews.substack.com/p/11-rag-vs-agents" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Z8AR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6c5cbf-f295-4741-b441-ed89a5a12a87_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!Z8AR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6c5cbf-f295-4741-b441-ed89a5a12a87_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!Z8AR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6c5cbf-f295-4741-b441-ed89a5a12a87_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!Z8AR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6c5cbf-f295-4741-b441-ed89a5a12a87_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Z8AR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6c5cbf-f295-4741-b441-ed89a5a12a87_1920x1080.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b6c5cbf-f295-4741-b441-ed89a5a12a87_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://everydaynews.substack.com/p/11-rag-vs-agents&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Z8AR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6c5cbf-f295-4741-b441-ed89a5a12a87_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!Z8AR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6c5cbf-f295-4741-b441-ed89a5a12a87_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!Z8AR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6c5cbf-f295-4741-b441-ed89a5a12a87_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!Z8AR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6c5cbf-f295-4741-b441-ed89a5a12a87_1920x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Types of AI Agents</h4><p>When we examine AI agents more closely, we can categorize them based on their cognitive sophistication and operational approach:</p><ol><li><p><strong>Reactive Agents</strong>: These agents respond to inputs in real time without considering historical data. Like a reflex action in humans, they follow predefined rules to generate immediate responses. Simple rule-based chatbots that provide answers based on pattern matching exemplify this category. While limited in complexity, reactive agents excel in scenarios requiring quick, consistent responses.</p></li><li><p><strong>Proactive Agents</strong>: These sophisticated agents plan ahead and anticipate future needs or tasks, much like how humans plan their day or prepare for upcoming events. Recommendation systems that suggest products before users search for them, or smart home systems that adjust temperature before residents return home, demonstrate proactive behavior. These agents often employ predictive models and historical analysis to make informed decisions about future actions.</p></li><li><p><strong>Multi-Agent Systems</strong>: These complex networks involve multiple agents collaborating and communicating to solve tasks, similar to how human teams work together. Consider a modern supply chain management system where different AI agents handle inventory forecasting, route optimization, and supplier communication, coordinating their actions to ensure smooth operations. These systems demonstrate emergent intelligence, where the collective capability exceeds that of any individual agent.</p></li></ol><h4>AI Agents in Practice</h4><p>The application of AI agents becomes particularly powerful in Retrieval-Augmented Generation (RAG) systems, where they serve as intelligent orchestrators of information flow. </p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;f709a2e0-99a6-4709-aee2-adb38f70397b&quot;,&quot;caption&quot;:&quot;&#128138; Pill of the week&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Issue #82 - Introduction to Agentic RAG&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:38707812,&quot;name&quot;:&quot;David Andr&#233;s&quot;,&quot;bio&quot;:&quot;&#128188; Data Scientist &#8226; &#128013; Python enthusiast&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db6423b2-36bc-440c-be7d-b54be5bad1b0_1447x1448.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-11-29T12:10:00.622Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dfdd6ad7-b068-4427-971a-73c60c76ec7d_1920x1280.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://mlpills.substack.com/p/issue-82-introduction-to-agentic-69d&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:152259425,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:3,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Machine Learning Pills&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dba4244-97d2-48f0-a2bb-b01c7ea74212_118x118.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>These agents transform raw data processing into sophisticated information management by dynamically managing multiple tasks:</p><p>A <strong>self-querying agent</strong>, for instance, operates like an expert librarian who understands both the user's needs and the library's organization. It performs several sophisticated functions:</p><ul><li><p>Analyzes user queries to understand implicit and explicit requirements</p></li><li><p>Determines optimal search parameters based on context and past interactions</p></li><li><p>Interfaces with vector databases to retrieve relevant documents</p></li><li><p>Adapts its retrieval strategy based on initial results</p></li></ul><p>When combined with specialized reranking agents, these systems create a dynamic information processing pipeline. The reranking agents act as critical reviewers, evaluating and prioritizing retrieved information based on relevance, reliability, and user context. This multi-stage process ensures that the final output not only matches the user's query but also provides meaningful, contextual information that anticipates their broader needs.</p><p>This advanced orchestration enables RAG systems to handle complex queries with human-like understanding while maintaining machine-level efficiency. The result is a system that can engage in nuanced information retrieval and generation, adapting to user needs while maintaining high performance standards.</p><h4><strong>Key Applications</strong></h4><p>Let&#8217;s see now how this concept can be applied to different industries:</p><ol><li><p><strong>Healthcare: </strong>AI agents analyze medical data (e.g., radiology scans or patient histories) to assist in disease detection and treatment planning. For example, advanced machine learning models can identify anomalies in X-rays or MRIs, enabling early intervention and reducing diagnostic errors.</p></li><li><p><strong>Finance: </strong>AI-powered agents automate fraud detection, financial forecasting, and risk management. By processing massive transaction datasets in real time, they identify suspicious patterns or optimize trading strategies with precision.</p></li><li><p><strong>Legal Services: </strong>AI agents streamline contract analysis and legal research. Through natural language processing (NLP), they sift through extensive case laws, enabling faster decision-making while minimizing human error.</p></li><li><p><strong>Logistics and Supply Chain: </strong>AI agents optimize delivery routes, inventory levels, and demand forecasts. These agents analyze historical and real-time data to improve efficiency, reduce costs, and minimize delays.</p></li><li><p><strong>Sales and Customer Support: </strong>From automating lead generation to offering personalized recommendations, AI agents help sales teams increase productivity. In customer support, chatbots resolve queries 24/7, improving customer satisfaction and reducing operational costs.</p></li></ol><p></p><h2><strong>Vertical SaaS</strong></h2><p><strong>Vertical SaaS </strong>(Software as a Service) <strong>platforms are cloud-based solutions tailored to industry-specific workflows</strong>. Unlike general SaaS tools, Vertical SaaS focuses on solving niche problems for particular domains, often integrating AI for smarter decision-making and automation.</p><h4><strong>Examples of Vertical SaaS in Industries</strong></h4><ol><li><p><strong>Healthcare</strong><br>SaaS platforms help manage patient records, treatment workflows, and compliance requirements. AI enhances these tools by providing predictive analytics for patient outcomes or automating scheduling processes.</p></li><li><p><strong>Real Estate</strong><br>Vertical SaaS tools analyze property market trends, automate listings, and assist with valuations. By combining AI with domain-specific data, these platforms predict price changes and streamline transactions.</p></li><li><p><strong>Construction</strong><br>Construction-focused SaaS solutions track project timelines, cost estimations, and resource allocation. AI models predict delays or budget overruns based on historical data, enabling proactive decision-making.</p></li><li><p><strong>Hospitality</strong><br>In hospitality, Vertical SaaS platforms manage bookings, personalize guest services, and optimize feedback management. AI-powered analytics help predict guest preferences to deliver customized experiences.</p></li></ol><p></p><h2><strong>Implications for data professionals</strong></h2><p>The rise of AI agents and Vertical SaaS platforms <strong>creates both challenges and opportunities for data professionals</strong>. As these technologies become increasingly specialized, data scientists and ML engineers play a critical role in driving innovation.</p><h4><strong>1. Specialization in Domain-Specific AI Solutions</strong></h4><p>AI solutions must align with industry-specific workflows, requiring data professionals to <strong>develop expertise in particular sectors</strong>. For instance, healthcare AI projects demand familiarity with medical imaging datasets, regulatory compliance (e.g., HIPAA), and clinical workflows.</p><h4><strong>2. Collaboration with Domain Experts</strong></h4><p>Effective AI implementation requires <strong>collaboration between data scientists and subject matter experts</strong>. Understanding sector-specific challenges helps ensure the models are interpretable, actionable, and aligned with real-world requirements.</p><h4><strong>3. Ethical Considerations and Model Auditing</strong></h4><p>AI systems can perpetuate biases or make opaque decisions, especially in sensitive areas like finance or healthcare. Data scientists must prioritize fairness, transparency, and accountability when developing and deploying models. <strong>Practices like bias mitigation, explainability techniques, and regular audits are crucial</strong>.</p><h4><strong>4. Rapid Skill Adaptation</strong></h4><p>The rapid advancements in AI tools and SaaS technologies demand <strong>continuous upskilling</strong>. Data scientists and ML engineers must stay updated with:</p><ul><li><p>New machine learning frameworks and model architectures.</p></li><li><p>Industry-specific AI tools or APIs (e.g., healthcare AI libraries or fintech automation tools).</p></li><li><p>Trends in AI ethics and regulations.</p></li></ul><h4><strong>5. Expanding Opportunities for Automation</strong></h4><p>The integration of AI agents into SaaS platforms enables significant automation of repetitive tasks. For instance, data labeling, feature engineering, and model deployment pipelines are increasingly automated, allowing data scientists to focus on higher-value strategic work.</p><p></p><h2><strong>Technical and Ethical Considerations</strong></h2><p>While AI agents and Vertical SaaS platforms offer transformative potential, their deployment raises technical and ethical challenges:</p><ul><li><p><strong>Data Privacy and Security</strong><br>Industries like finance and healthcare handle sensitive user data. Data professionals must implement robust security measures (e.g., encryption, anonymization) and ensure compliance with regulations such as GDPR or HIPAA.</p></li><li><p><strong>Algorithmic Bias</strong><br>Biased data can produce discriminatory outcomes. For instance, an AI-powered hiring platform may unintentionally favor one demographic group over others. Ensuring diverse and representative training datasets is vital.</p></li><li><p><strong>Scalability and Reliability</strong><br>Vertical SaaS solutions must scale efficiently to handle growing data volumes without compromising performance. AI systems also need to be robust enough for critical, real-time tasks like fraud detection or healthcare diagnostics.</p></li><li><p><strong>Impact on Jobs</strong><br>Automation through AI agents could displace certain roles, especially those involving repetitive tasks. However, it <strong>also creates demand for new roles, such as AI model auditors, AI product managers, and specialized ML engineers</strong>.</p></li></ul><p></p><h2><strong>Future Outlook for AI Agents and Vertical SaaS</strong></h2><p>The synergy between AI agents and Vertical SaaS is set to deepen as industries adopt smarter, data-driven tools. Data scientists and ML engineers will be at the center of this transformation, developing models that improve efficiency, reduce costs, and enable innovative solutions.</p><ul><li><p>In <strong>healthcare</strong>, AI-driven diagnostics will achieve greater accuracy.</p></li><li><p>In <strong>finance</strong>, automated decision-making will further optimize risk management.</p></li><li><p>In <strong>logistics</strong>, real-time predictive analytics will enhance operational efficiency.</p></li></ul><div class="pullquote"><p>For data professionals, the future demands a balance of <strong>technical expertise</strong>, <strong>ethical responsibility</strong>, and <strong>domain knowledge</strong>. Those who embrace specialization, continuous learning, and collaboration with industry experts will lead the charge in shaping AI-driven transformations.</p></div><p></p><h2><strong>Conclusion</strong></h2><p>AI agents and Vertical SaaS platforms are revolutionizing industries through automation, efficiency, and tailored solutions. For data scientists and ML engineers, this shift offers opportunities to build impactful, domain-specific AI systems. By addressing challenges like data privacy, bias, and scalability, professionals can ensure that AI is not only effective but also responsible and trustworthy. In an increasingly AI-powered world, the ability to navigate these trends will define success for both businesses and data professionals alike.</p><div><hr></div><p></p><h1><strong>&#10067;Interested in AI-business insights?</strong></h1><p>If you enjoyed this issue and want to stay on top of AI business trends and applications, let us know! <strong>We&#8217;re considering adding an extra issue during the week dedicated to this topic.</strong></p><p>You can also check out the <em><a href="https://everydaynews.substack.com/">&#8220;Everyday&#8221; newsletter</a></em><a href="https://everydaynews.substack.com/"> </a>for more frequent updates.</p><div class="embedded-publication-wrap" data-attrs="{&quot;id&quot;:3396369,&quot;name&quot;:&quot;Everyday | AI for all&quot;,&quot;logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e30e3b4-4ca0-4c46-ad83-7418ec82fc26_500x500.png&quot;,&quot;base_url&quot;:&quot;https://everydaynews.substack.com&quot;,&quot;hero_text&quot;:&quot;AI for Everyone. No prior AI knowledge needed! Get concise, easy-to-understand insights into the essential concepts driving today&#8217;s AI revolution.&quot;,&quot;author_name&quot;:&quot;Everyday by Pol&quot;,&quot;show_subscribe&quot;:true,&quot;logo_bg_color&quot;:&quot;#ffffff&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPublicationToDOMWithSubscribe"><div class="embedded-publication show-subscribe"><a class="embedded-publication-link-part" native="true" href="https://everydaynews.substack.com?utm_source=substack&amp;utm_campaign=publication_embed&amp;utm_medium=web"><img class="embedded-publication-logo" src="https://substackcdn.com/image/fetch/$s_!Jjyu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e30e3b4-4ca0-4c46-ad83-7418ec82fc26_500x500.png" width="56" height="56" style="background-color: rgb(255, 255, 255);"><span class="embedded-publication-name">Everyday | AI for all</span><div class="embedded-publication-hero-text">AI for Everyone. No prior AI knowledge needed! Get concise, easy-to-understand insights into the essential concepts driving today&#8217;s AI revolution.</div><div class="embedded-publication-author-name">By Everyday by Pol</div></a><form class="embedded-publication-subscribe" method="GET" action="https://everydaynews.substack.com/subscribe?"><input type="hidden" name="source" value="publication-embed"><input type="hidden" name="autoSubmit" value="true"><input type="email" class="email-input" name="email" placeholder="Type your email..."><input type="submit" class="button primary" value="Subscribe"></form></div></div><p>Of course, if this isn&#8217;t your cup of tea, no worries&#8212;<em><strong>MLPills</strong></em><strong> will keep delivering the same great content you know and love</strong>!</p><div class="poll-embed" data-attrs="{&quot;id&quot;:246934}" data-component-name="PollToDOM"></div><p>Many thanks for your time &#128522; I hope you liked this slightly different issue!</p><div class="poll-embed" data-attrs="{&quot;id&quot;:246945}" data-component-name="PollToDOM"></div><p></p>]]></content:encoded></item><item><title><![CDATA[RW #1 - Reducing Customer Churn for a Telecom Company]]></title><description><![CDATA[Customer churn, or the rate at which customers leave a service, is a critical issue for telecom companies. High churn rates can significantly impact revenue and profitability. By leveraging Data Science, telecom companies can predict which customers are likely to churn and take proactive measures to retain them.&#160;Here's a detailed step-by-step guide to tackling this problem using Data Science.]]></description><link>https://mlpills.substack.com/p/rw-1-reducing-customer-churn-for</link><guid isPermaLink="false">https://mlpills.substack.com/p/rw-1-reducing-customer-churn-for</guid><dc:creator><![CDATA[David Andrés]]></dc:creator><pubDate>Sun, 18 Aug 2024 07:39:23 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/65fd3cf6-b4e7-4f4b-a38f-cf301227a849_1920x1281.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>&#128138; Pill of the Week</h1><p>Today we release a <strong>new section</strong>! Learn how you could implement a technique or a process in a <strong>Real World</strong> scenario. Today we share the <strong>life cycle of a customer churn prediction project</strong>.</p><div class="pullquote"><p>Customer churn, or the rate at which customers leave a service, is a critical issue for telecom companies. High churn rates can significantly impact revenue and profitability. By leveraging Data Science, telecom companies can predict which customers are likely to churn and take proactive measures to retain them. </p></div><p>Here's a<strong> detailed step-by-step guide</strong> to tackling this problem using Data Science.</p><p>For the explanation of each step <em>without</em> focusing on this concrete example, you can check the <a href="https://mlpills.substack.com/p/issue-70-the-life-cycle-in-a-data">previous issue</a>:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;5802c42d-b431-47c3-8dd8-bcc674430144&quot;,&quot;caption&quot;:&quot;&#128138; Pill of the Week&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Issue #70 - The life cycle in a Data Science project&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:38707812,&quot;name&quot;:&quot;David Andr&#233;s&quot;,&quot;bio&quot;:&quot;&#128188; Data Scientist &#8226; &#128013; Python enthusiast&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db6423b2-36bc-440c-be7d-b54be5bad1b0_1447x1448.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2024-08-11T09:24:45.345Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/005845e8-6768-4435-987c-83e61ce67d88_1920x1280.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://mlpills.substack.com/p/issue-70-the-life-cycle-in-a-data&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:147067241,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:6,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;Machine Learning Pills&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dba4244-97d2-48f0-a2bb-b01c7ea74212_118x118.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://x.com/daansan_ml/status/1822542487527084324" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aRHi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6a1193-106e-4589-a276-8030b93887dc_900x889.jpeg 424w, https://substackcdn.com/image/fetch/$s_!aRHi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6a1193-106e-4589-a276-8030b93887dc_900x889.jpeg 848w, https://substackcdn.com/image/fetch/$s_!aRHi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6a1193-106e-4589-a276-8030b93887dc_900x889.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!aRHi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6a1193-106e-4589-a276-8030b93887dc_900x889.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aRHi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6a1193-106e-4589-a276-8030b93887dc_900x889.jpeg" width="566" height="559.0822222222222" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef6a1193-106e-4589-a276-8030b93887dc_900x889.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:889,&quot;width&quot;:900,&quot;resizeWidth&quot;:566,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://x.com/daansan_ml/status/1822542487527084324&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!aRHi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6a1193-106e-4589-a276-8030b93887dc_900x889.jpeg 424w, https://substackcdn.com/image/fetch/$s_!aRHi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6a1193-106e-4589-a276-8030b93887dc_900x889.jpeg 848w, https://substackcdn.com/image/fetch/$s_!aRHi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6a1193-106e-4589-a276-8030b93887dc_900x889.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!aRHi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef6a1193-106e-4589-a276-8030b93887dc_900x889.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h1><strong>1. Define the Problem or Question to be Answered</strong></h1><p><strong>Objective:</strong><br>The goal is to reduce customer churn by 20% over the next quarter through predictive modeling.</p><p><strong>Problem Statement:</strong><br><em>"How can we reduce customer churn by 20% in the next quarter by identifying at-risk customers and implementing targeted retention strategies?"</em></p>
      <p>
          <a href="https://mlpills.substack.com/p/rw-1-reducing-customer-churn-for">
              Read more
          </a>
      </p>
   ]]></content:encoded></item></channel></rss>