<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://scrapingnerd.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://scrapingnerd.github.io/" rel="alternate" type="text/html" /><updated>2026-04-12T11:09:19+07:00</updated><id>https://scrapingnerd.github.io/feed.xml</id><title type="html">Scraping Nerd</title><subtitle>Deep dives into advanced web scraping techniques, APIs, and no-code data extraction for TikTok, X.com, and more. Tutorials by Novi Develop.</subtitle><author><name>Novi Develop</name></author><entry><title type="html">Building a Fault-Tolerant YouTube Data Extraction Pipeline for 200k Channels</title><link href="https://scrapingnerd.github.io/building-fault-tolerant-youtube-extraction-pipeline/" rel="alternate" type="text/html" title="Building a Fault-Tolerant YouTube Data Extraction Pipeline for 200k Channels" /><published>2026-04-10T00:00:00+07:00</published><updated>2026-04-10T00:00:00+07:00</updated><id>https://scrapingnerd.github.io/building-fault-tolerant-youtube-extraction-pipeline</id><content type="html" xml:base="https://scrapingnerd.github.io/building-fault-tolerant-youtube-extraction-pipeline/"><![CDATA[<h1 id="building-a-fault-tolerant-youtube-data-extraction-pipeline-for-200k-channels">Building a Fault-Tolerant YouTube Data Extraction Pipeline for 200k Channels</h1>

<p>Extracting data from 200,000 YouTube channels on a daily basis is a monumental engineering challenge. You’ll quickly find that leaning on the normal <strong>YouTube Data API v3</strong> leads to instantaneous quota exhaustion. Shifting immediately to robust tools like <strong>yt-dlp</strong> will land you in the penalty box with CAPTCHAs, and trying to orchestrate <strong>Puppeteer</strong> or <strong>Playwright</strong> grids for headless DOM rendering will chew through your server budget without even making a dent in the daily workload.</p>

<p>To operate at this scale, heavy browser automation has to go. You must rethink your approach to favor lightweight endpoints, precise delta detection, and robust IP management. Here is a blueprint for doing exactly that.</p>

<hr />

<h2 id="1-ditching-the-dom-a-lean-extraction-strategy">1. Ditching the DOM: A Lean Extraction Strategy</h2>

<p>Traditional scraping methodologies are too fat. They download massive amounts of irrelevant assets and burn CPU cycles rendering layouts. Your objective is raw data, fast.</p>

<h3 id="rss-feeds-as-the-frontline-filter">RSS Feeds as the Frontline Filter</h3>
<p>There’s no need to execute a deep scrape on 200k channels if nothing has changed. Rely on YouTube’s built-in RSS capabilities to verify updates.</p>
<ul>
  <li><strong>The Target:</strong> <code class="language-plaintext highlighter-rouge">https://www.youtube.com/feeds/videos.xml?channel_id=CHANNEL_ID</code></li>
  <li><strong>The Mechanism:</strong> This XML feed is extremely lightweight and fetching it almost never flags bot-protection algorithms. Use it to check for new <code class="language-plaintext highlighter-rouge">entry</code> nodes or updated publishing timestamps. If there is a delta, push the channel ID into your deep-scrape queue. Otherwise, skip it.</li>
</ul>

<h3 id="pivot-to-the-innertube-api">Pivot to the InnerTube API</h3>
<p>Treating <strong>Playwright</strong> as a primary scraping tool is an antipattern for scale. You want to bypass the web UI entirely and interact directly with <strong>InnerTube</strong> (<code class="language-plaintext highlighter-rouge">/youtubei/v1/</code>), YouTube’s internal API.</p>
<ul>
  <li><strong>The Tactics:</strong> Open your network tab, analyze the JSON payloads being dispatched to InnerTube endpoints, and recreate these POST requests natively. Use fast network libraries (like <code class="language-plaintext highlighter-rouge">requests</code> in Python or <code class="language-plaintext highlighter-rouge">net/http</code> in Go) making sure to inject the appropriate <code class="language-plaintext highlighter-rouge">x-youtube-client-name</code> and <code class="language-plaintext highlighter-rouge">x-youtube-client-version</code> parameters.</li>
  <li><strong>Alternative Approaches:</strong> If <code class="language-plaintext highlighter-rouge">yt-dlp</code> is mandatory for highly specific metadata tasks, avoid invoking its web extractor. Submitting the arguments <code class="language-plaintext highlighter-rouge">player_client=android</code> or <code class="language-plaintext highlighter-rouge">player_client=ios</code> leverages mobile endpoints, which possess a separate and often looser set of anti-bot heuristics.</li>
</ul>

<hr />

<h2 id="2-ip-reputation-and-traffic-camouflage">2. IP Reputation and Traffic Camouflage</h2>

<p>Even with hyper-optimized API requests, blasting a single domain at this frequency guarantees a swift IP ban unless you manage your footprint.</p>

<h3 id="rotation-mechanics-and-proxy-quality">Rotation Mechanics and Proxy Quality</h3>
<p>Using AWS or DigitalOcean static IPs to scrape YouTube is a dead end.</p>
<ul>
  <li><strong>Residential Networks:</strong> You are required to bounce requests across premium <strong>residential proxies</strong>. These IPs look like ordinary household internet setups, providing excellent camouflage against WAFs.</li>
  <li><strong>Intelligent Cycling:</strong> While rotating IPs is critical, you must maintain session stickiness across paginated extractions. Ensure that a multi-page pull for a single channel is executed from a single IP, then switch proxies before tackling the next channel.</li>
</ul>

<h3 id="ja3ja4-tls-handshake-mimicry">JA3/JA4 TLS Handshake Mimicry</h3>
<p>Modern enterprise defenses, including Google’s WAF, inspect the cryptographic signatures of incoming connections—specifically the <strong>JA3/JA4 TLS fingerprint</strong>.</p>
<ul>
  <li><strong>The Solution:</strong> A vanilla HTTP client is instantly recognizable as non-human. Utilize specialized packages like <strong>utls</strong> in Go or <strong>curl-impersonate</strong> to spoof the precise TLS handshake patterns of real Chrome or Safari browsers.</li>
</ul>

<hr />

<h2 id="3-data-flow-orchestration">3. Data Flow Orchestration</h2>

<p>Acquiring the raw data is meaningless unless it’s correctly buffered, normalized, and persisted.</p>

<h3 id="distributed-queues">Distributed Queues</h3>
<p>A monolithic script has no place here. Decouple your crawling steps using message brokers.</p>
<ul>
  <li><strong>The Engine:</strong> Implement <strong>Kafka</strong> or <strong>RabbitMQ</strong> to maintain URL state.</li>
  <li><strong>Worker Redundancy:</strong> Disperse the workload across dozens of isolated workers. Should a task fail or a proxy die, the message safely returns to the queue to be attempted by another worker using a fresh IP.</li>
</ul>

<h3 id="scalable-storage-layers">Scalable Storage Layers</h3>
<p>Writing highly volatile data across 200,000 entities daily will paralyze a poorly designed schema.</p>
<ul>
  <li><strong>Metadata Persistence:</strong> Rely on <strong>PostgreSQL</strong> or <strong>MySQL</strong> exclusively for static baseline metadata (e.g., creation dates, channel UUIDs).</li>
  <li><strong>Metric Ingestion:</strong> Push rapidly changing statistics (like view counts or sub counts) into a dedicated time-series environment—such as <strong>ClickHouse</strong> or <strong>TimescaleDB</strong>. This isolates heavy read/write analytical loads from your core relational databases.</li>
</ul>

<h3 id="graceful-degradation">Graceful Degradation</h3>
<p>When faced with HTTP 429 errors, immediately halting or violently retrying are both bad options. Integrate <strong>exponential backoff</strong> combined with randomized <strong>jitter</strong> intervals. This prevents your worker arrays from accidentally unleashing a self-inflicted DDoS attack upon your own proxy infrastructure when a target endpoint momentarily wobbles.</p>

<hr />

<h2 id="4-operational-guidelines-and-safety">4. Operational Guidelines and Safety</h2>

<p>Engineering at high throughput means you must respect infrastructure realities and legal guidelines.</p>

<ul>
  <li><strong>Data Scope:</strong> Always restrict extraction pipelines to public-facing, generic facts—such as numerical performance metrics. Do not attempt to siphon personal identifiers, private videos, or circumvent authentication blocks.</li>
  <li><strong>Polite Spacing:</strong> Spiking 200,000 requests in a brief window is reckless. Distribute the operational load continuously over the course of 24 hours to reduce anomaly detection severity and behave like a good internet citizen.</li>
</ul>]]></content><author><name>Novi Develop</name></author><summary type="html"><![CDATA[Building a Fault-Tolerant YouTube Data Extraction Pipeline for 200k Channels]]></summary></entry><entry><title type="html">Claude Code Architecture (Part 3): Automated Context Compaction for Data Ops</title><link href="https://scrapingnerd.github.io/automation/automated-context-compaction-data-ops/" rel="alternate" type="text/html" title="Claude Code Architecture (Part 3): Automated Context Compaction for Data Ops" /><published>2026-04-06T12:00:00+07:00</published><updated>2026-04-06T12:00:00+07:00</updated><id>https://scrapingnerd.github.io/automation/automated-context-compaction-data-ops</id><content type="html" xml:base="https://scrapingnerd.github.io/automation/automated-context-compaction-data-ops/"><![CDATA[<p>Welcome to the end of our specialized ScrapingNerd architectural review. After dissecting the agentic <a href="/automation/agentic-search-ripgrep-deep-dive/">Grep integration in Part 2</a>, we arrive at the absolute crux of high-efficiency AI automation limits: Handling Context Bloat.</p>

<p>Executing tools autonomously via scripts is trivial until you receive a 50MB payload stack trace response. When context sizes explode, operations halt. Claude Code uses complex <strong>Automated Context Compaction</strong> thresholds.</p>

<h2 id="the-tool-result-budget-limits">The Tool Result Budget Limits</h2>

<p>Before automated sequences cascade their outputs backwards to the inference cycle, Claude filters it through an intermediate budget layer (<code class="language-plaintext highlighter-rouge">applyToolResultBudget()</code>). Suppose you parse massive JSON web request tables entirely within the backend system. Claude calculates <code class="language-plaintext highlighter-rouge">maxResultSizeChars</code> stringently preventing payloads beyond 20,000 textual lines from resolving.</p>

<p>Rather than sending 20,000 characters, it seamlessly injects the target payload dynamically against physical disk storage, returning a truncated system proxy message linking directly indicating its persistent local address.</p>

<h2 id="silent-data-snips">Silent Data Snips</h2>

<p>Data automation thrives via chronological retention. However, previous automated processes hold decreasing inference value linearly alongside execution time.</p>

<p>Claude runs the <code class="language-plaintext highlighter-rouge">HISTORY_SNIP</code> flag inside the execution loop. If it generated data payloads 8 API iterations previously, it actively triggers programmatic removals deleting inner payload constructs whilst intelligently preserving outer tool brackets (so the LLM remembers it theoretically ran the logic previously, without processing the huge textual return cost perpetually).</p>

<h2 id="threshold-session-syntheses-autocompact">Threshold Session Syntheses (Autocompact)</h2>

<p>The ultimate safety valve executes fully autonomously. When automation queries span 13,000 tokens of the maximum token size array threshold limit, an immediate asynchronous fail-safe triggers.</p>

<p><strong>Strategy (Session Memory Compaction):</strong></p>
<ol>
  <li>The backend pauses operations pausing active data traversal tools immediately.</li>
  <li>An isolated Forked Agent executes silently locally querying raw transcript dumps.</li>
  <li>The isolated agent compresses exact project workflows (<code class="language-plaintext highlighter-rouge"># Workflow</code>, <code class="language-plaintext highlighter-rouge"># Files and Functions</code>, <code class="language-plaintext highlighter-rouge"># Key Results</code>) extracting structured outputs purely via synthesized summaries dynamically persisted via hidden <code class="language-plaintext highlighter-rouge">.claude/session_memory</code> markers.</li>
  <li>The active terminal clears all transient bloated payloads keeping solely the pure analytic structural mapping before gracefully continuing its scraping objective pipeline.</li>
</ol>

<p>This automated cleanup acts invisibly to the user layer guaranteeing terminal automations never experience memory starvation operations regardless of length!</p>]]></content><author><name>Scraping Nerd</name></author><category term="Memory State" /><category term="LLM Tuning" /><category term="Data Operations" /><category term="Backend Engineering" /><summary type="html"><![CDATA[The final post in the ScrapingNerd series. We explore how Claude manipulates log sizes specifically targeting data payload context drops via SM-Compact.]]></summary></entry><entry><title type="html">Claude Code Architecture (Part 2): Deep Dive into Agentic Search (GrepTool)</title><link href="https://scrapingnerd.github.io/automation/agentic-search-ripgrep-deep-dive/" rel="alternate" type="text/html" title="Claude Code Architecture (Part 2): Deep Dive into Agentic Search (GrepTool)" /><published>2026-04-06T11:00:00+07:00</published><updated>2026-04-06T11:00:00+07:00</updated><id>https://scrapingnerd.github.io/automation/agentic-search-ripgrep-deep-dive</id><content type="html" xml:base="https://scrapingnerd.github.io/automation/agentic-search-ripgrep-deep-dive/"><![CDATA[<p>Automating large data pipelines is strictly contingent upon traversing massive amounts of content quickly. For coding automation, that means mastering search limits. Following <a href="/automation/automation-infrastructure-claude-code/">Part 1’s</a> overview, we will now analyze Claude’s weapon of choice: The <code class="language-plaintext highlighter-rouge">GrepTool</code>.</p>

<p>You simply cannot feed thousands of files directly into an orchestration prompt. Claude resolves this constraint by embedding ultra-fast local binary search mechanisms natively utilizing <strong>Ripgrep (rg)</strong>.</p>

<h2 id="binary-execution-and-fallback">Binary Execution and Fallback</h2>

<p>Spawning non-native binaries in Node across cross-platform environments leads easily to silent failures. To fix this, Claude implements the <code class="language-plaintext highlighter-rouge">getRipgrepConfig()</code> process priority sequence:</p>
<ol>
  <li><strong>System Resolver:</strong> Checks for pre-installed user PATH applications.</li>
  <li><strong>Embedded Build:</strong> Dispatches specifically via static internal bundlers inside the primary binary pipeline.</li>
  <li><strong>Local Vendor Fallback:</strong> Deploys a customized architectural platform binary bundled specifically during npm setup (e.g. <code class="language-plaintext highlighter-rouge">vendor/ripgrep/{arch}-{platform}/rg</code>).</li>
</ol>

<h2 id="building-the-safe-scrape">Building the Safe Scrape</h2>

<p>When extracting information, garbage in yields garbage out. Claude automatically forces parameters to guarantee stability:</p>
<ul>
  <li>Automatically appends <code class="language-plaintext highlighter-rouge">--max-columns 500</code> to prevent infinite log line string overflows from polluting LLM comprehension.</li>
  <li>Auto-excludes noisy roots like <code class="language-plaintext highlighter-rouge">.git</code>, <code class="language-plaintext highlighter-rouge">.hg</code>, and <code class="language-plaintext highlighter-rouge">.bzr</code>.</li>
  <li>Translates complex string flags into matching semantic filters seamlessly (such as defining multi-line context parsing commands safely into executable scripts).</li>
</ul>

<h2 id="resilience--the-eagain-defense">Resilience &amp; the EAGAIN Defense</h2>

<p>Automation inevitably runs entirely headlong into OS connection pool limitations.</p>

<p>Running parallel extraction traces occasionally triggers localized Docker limitations (specifically spawning threads). This usually crashes pipelines gracefully returning obscure <code class="language-plaintext highlighter-rouge">resource temporarily unavailable (EAGAIN)</code> flags. The Claude system intercepts this exact error profile via Regex logic, and instantly retries the query injecting <code class="language-plaintext highlighter-rouge">-j 1</code> (forcing single-threaded processing). It safely sacrifices marginal speed metrics to guarantee absolute reliability.</p>

<h2 id="achieving-paged-search">Achieving Paged Search</h2>

<p>To prevent completely blowing the data extraction payload size out, <code class="language-plaintext highlighter-rouge">GrepTool</code> implements rigorous <strong>Offset Pagination</strong>. The total raw output matches are artificially constrained utilizing variables like <code class="language-plaintext highlighter-rouge">head_limit</code>. If Ripgrep parses 100,000 matches, it will supply Claude exactly 250 matches trailed by the warning: <code class="language-plaintext highlighter-rouge">[Showing results with pagination = limit: 250]</code>.</p>

<p>Claude will analyze the partial block, infer data properties iteratively, and optionally prompt successive tool hits querying offset indexes (<code class="language-plaintext highlighter-rouge">offset: 250</code>) infinitely allowing total traversal of monolithic codebases structurally in chunks.</p>

<p>In <strong>Part 3</strong>, we scale our automation metrics by observing how Claude manages those massive data dumps memory constraints without failing contexts.</p>]]></content><author><name>Scraping Nerd</name></author><category term="Ripgrep" /><category term="Search Engine" /><category term="Scalable Automation" /><category term="Data Parsing" /><summary type="html"><![CDATA[Part 2 of our ScrapingNerd series. Discover how Claude scales search integration via Ripgrep, binary fallbacks, and rigorous offset pagination.]]></summary></entry><entry><title type="html">Claude Code Architecture (Part 1): The Automation Infrastructure</title><link href="https://scrapingnerd.github.io/automation/automation-infrastructure-claude-code/" rel="alternate" type="text/html" title="Claude Code Architecture (Part 1): The Automation Infrastructure" /><published>2026-04-06T10:00:00+07:00</published><updated>2026-04-06T10:00:00+07:00</updated><id>https://scrapingnerd.github.io/automation/automation-infrastructure-claude-code</id><content type="html" xml:base="https://scrapingnerd.github.io/automation/automation-infrastructure-claude-code/"><![CDATA[<p>Welcome to Part 1 of the architecture series dissecting <strong>Claude Code v2.1.88</strong>. From a data engineering and automation perspective, observing how an LLM can traverse directories, execute terminal sequences, and systematically digest massive data structures is a masterclass in resilient infrastructure.</p>

<p>Traditional LLM wrappers suffer heavily from arbitrary hallucinated outputs mapping improperly to CLI flags. Claude tackles this through structured orchestration.</p>

<h2 id="integrating-the-terminal">Integrating the Terminal</h2>

<p>Unlike conventional chat UI frameworks, Claude uses a node backend directly interfaced with your machine’s environment. The primary input loop passes all terminal traces through an abstract system labeled the <code class="language-plaintext highlighter-rouge">QueryEngine</code>.</p>

<p>When you ask the system to “find specific parsing variables in the project,” the LLM isn’t randomly guessing strings. The input generates a structured <code class="language-plaintext highlighter-rouge">tool_use</code> JSON format natively connected to predefined execution models (e.g. <code class="language-plaintext highlighter-rouge">BashTool</code>, <code class="language-plaintext highlighter-rouge">FileReadTool</code>, <code class="language-plaintext highlighter-rouge">GlobTool</code>).</p>

<h3 id="the-zod-validation-layer">The Zod Validation Layer</h3>

<p>Handling raw text integration requires unyielding validation bounds. Before a tool is authorized to scrape file content from your directory, the pipeline passes the raw Anthropic API payload directly into Zod schema interpreters. If Claude suggests traversing <code class="language-plaintext highlighter-rouge">~/.data/*</code> but utilizes the wrong recursive parameter tag, the Zod handler immediately traps the schema violation and feeds an error trace automatically back into the LLM context.</p>

<p>This creates a self-healing automation loop where the LLM realizes its own scraping error and retries syntactically perfectly.</p>

<h2 id="parallel-tool-execution">Parallel Tool Execution</h2>

<p>Automation thrives on speed. One of the greatest optimization factors within the infrastructure lies in the tool batching technique.</p>

<p>Identifying mutability is executed prior to any process execution. If Claude decides it needs to scrape logic structures from 10 different <code class="language-plaintext highlighter-rouge">.py</code> files across varying directories using <code class="language-plaintext highlighter-rouge">FileReadTool</code>, the backend explicitly flags that interaction as <code class="language-plaintext highlighter-rouge">readOnly: true</code>. <code class="language-plaintext highlighter-rouge">Promise.all</code> then spawns the commands asynchronously in a concurrent bash execution layer. Thus, extracting multiple elements is massively parallelized rather than serially bottlenecks.</p>

<p>In <strong>Part 2</strong>, we will shift entirely to examine the single most sophisticated automation driver in the architecture: <strong>Agentic Search and the GrepTool layer</strong>.</p>]]></content><author><name>Scraping Nerd</name></author><category term="Automation" /><category term="Tool Execution" /><category term="Parsing" /><category term="Integration" /><summary type="html"><![CDATA[Part 1 of our ScrapingNerd series. We analyze how Claude Code successfully parses LLM output to run seamless automation structures in the terminal.]]></summary></entry><entry><title type="html">Advanced Anti-Bot Evasion: Defeating TLS Fingerprinting and CDP Detection</title><link href="https://scrapingnerd.github.io/advanced-anti-bot-evasion-tls-cdp/" rel="alternate" type="text/html" title="Advanced Anti-Bot Evasion: Defeating TLS Fingerprinting and CDP Detection" /><published>2026-04-03T10:00:00+07:00</published><updated>2026-04-03T10:00:00+07:00</updated><id>https://scrapingnerd.github.io/advanced-anti-bot-evasion-tls-cdp</id><content type="html" xml:base="https://scrapingnerd.github.io/advanced-anti-bot-evasion-tls-cdp/"><![CDATA[<p>Scraping modern web applications protected by enterprise-grade solutions like Cloudflare Turnstile, Akamai, or DataDome requires more than simple request-response cycles. To successfully extract data from high-security targets without encountering <strong>403 Forbidden</strong> errors or <strong>CAPTCHA</strong> loops, you need deep control over your browser fingerprint and network characteristics.</p>

<p>Here is a deep dive into the technical vectors you must master to bypass modern passive and active fingerprinting.</p>

<h2 id="1-defeating-passive-tls-and-http2-fingerprinting">1. Defeating Passive TLS and HTTP/2 Fingerprinting</h2>

<p>Modern WAFs employ passive fingerprinting to identify automated scripts at the network layer before a single line of JavaScript even executes.</p>

<h3 id="tls-fingerprinting-ja3">TLS Fingerprinting (JA3)</h3>
<p>When a client initiates an HTTPS connection, it sends a “Client Hello” packet. Standard libraries like Python’s <code class="language-plaintext highlighter-rouge">requests</code> or <code class="language-plaintext highlighter-rouge">aiohttp</code> generate a very distinct <strong>JA3 Fingerprint</strong> that immediately signals “non-browser” traffic to the WAF.</p>
<ul>
  <li><strong>The Fix:</strong> You must use libraries that support TLS Client Hello GREASE and extension shuffling. Tools like Python’s <strong><code class="language-plaintext highlighter-rouge">curl_cffi</code></strong> or <strong><code class="language-plaintext highlighter-rouge">httpx</code></strong> with specialized transporters can perfectly mimic Chrome’s TLS handshake, drastically reducing early-stage blocking.</li>
</ul>

<h3 id="http2-prioritization">HTTP/2 Prioritization</h3>
<p>Browsers have highly specific, deterministic patterns for requesting resources (CSS, JS, Images) across HTTP/2 multiplexed frames.</p>
<ul>
  <li><strong>Implementation:</strong> Always ensure your scraping client fully supports <strong>HTTP/2</strong>. When using automation drivers like <strong>Playwright</strong> or <strong>Puppeteer</strong>, the browser engine handles this natively, but you must ensure your header order perfectly matches the expected profile of the emulated browser version.</li>
</ul>

<h2 id="2-browser-stealth-and-environment-masking">2. Browser Stealth and Environment Masking</h2>

<p>If a target requires JavaScript execution (e.g., executing a Cloudflare Turnstile challenge), you have no choice but to utilize a headless browser. However, default headless modes leak multiple flags, such as the <code class="language-plaintext highlighter-rouge">navigator.webdriver</code> property.</p>

<ul>
  <li><strong>Playwright with Stealth:</strong> Implement the <strong><code class="language-plaintext highlighter-rouge">stealth</code></strong> plugin via <strong><code class="language-plaintext highlighter-rouge">playwright-extra</code></strong>. This plugin actively patches common leaks, including <code class="language-plaintext highlighter-rouge">chrome.runtime</code>, WebGL vendor strings, and default viewport dimensions.</li>
  <li><strong>Canvas and WebGL Noise:</strong> Advanced anti-bots utilize hardware rendering to “fingerprint” your GPU signature. It’s highly recommended to inject script that adds slight, consistent “noise” into Canvas rendering. This effectively prevents the WAF from tracking your session across different requests based on your hardware profile.</li>
  <li><strong>The CDP Leak:</strong> Standard automation uses the Chrome DevTools Protocol (CDP), which often leaves detectable traces. For the most hardened targets, consider deploying <strong>undetected-chromedriver</strong> or specialized, modified browser binaries (like Brave or custom Chromium builds) to bypass deep environment introspection.</li>
</ul>

<h2 id="3-behavioral-emulation-avoiding-robotic-heuristics">3. Behavioral Emulation: Avoiding Robotic Heuristics</h2>

<p>Having a perfect fingerprint means nothing if your bot moves like a bot.</p>

<ul>
  <li><strong>Optical-Motor Lag:</strong> Avoid linear cursor movements. Utilize <strong>Bezier curves</strong> and introduce random jitters to simulate human mouse interaction.</li>
  <li><strong>Variable Latency:</strong> Never fire requests or actions at exact intervals. Implement a <strong>Gaussian distribution</strong> for delays (e.g., <code class="language-plaintext highlighter-rouge">time.sleep(random.gauss(5, 1))</code>).</li>
  <li><strong>Input Dynamics:</strong> When interacting with forms, introduce realistic and varying delays between <code class="language-plaintext highlighter-rouge">keydown</code> and <code class="language-plaintext highlighter-rouge">keyup</code> events.</li>
</ul>

<p>Mastering these core components—network-level masking, deep environment stealth, and behavioral emulation—is the baseline requirement for maintaining a successful data extraction pipeline against modern WAFs. If you’d prefer to skip the complexity of building this stack yourself, our <a href="/tools/">pre-built scraping tools</a> already incorporate these evasion techniques for platforms like TikTok and X.com.</p>]]></content><author><name>Novi Develop</name></author><category term="Web Scraping" /><category term="Security" /><category term="Automation" /><category term="tls-fingerprinting" /><category term="ja3" /><category term="puppeteer" /><category term="playwright" /><category term="anti-bot" /><summary type="html"><![CDATA[Scraping modern web applications protected by enterprise-grade solutions like Cloudflare Turnstile, Akamai, or DataDome requires more than simple request-response cycles. To successfully extract data from high-security targets without encountering 403 Forbidden errors or CAPTCHA loops, you need deep control over your browser fingerprint and network characteristics.]]></summary></entry><entry><title type="html">The Claude Code Source Map Leak: Inside the Agentic Harness</title><link href="https://scrapingnerd.github.io/engineering/claude-code-source-map-leak/" rel="alternate" type="text/html" title="The Claude Code Source Map Leak: Inside the Agentic Harness" /><published>2026-04-03T08:00:00+07:00</published><updated>2026-04-03T08:00:00+07:00</updated><id>https://scrapingnerd.github.io/engineering/claude-code-source-map-leak-agentic-harness</id><content type="html" xml:base="https://scrapingnerd.github.io/engineering/claude-code-source-map-leak/"><![CDATA[<p>The “Great Claude Code Leak” of March 2026 stands as a watershed moment in the security of AI development tools. By inadvertently exposing the <strong>Agentic Harness</strong> of Anthropic’s flagship CLI tool, the incident provided an unprecedented look at high-level AI orchestration while highlighting critical vulnerabilities in modern JavaScript/TypeScript build pipelines.</p>

<p>Here we’ll analyze the technical root cause of the leak and deconstruct what the exposed architectural code tells us about state-of-the-art AI agent tooling.</p>

<h2 id="1-technical-root-cause-the-source-map-oversight">1. Technical Root Cause: The Source Map Oversight</h2>

<p>The leak was not the result of a sophisticated breach but a <strong>misconfiguration</strong> within the deployment pipeline.</p>

<ul>
  <li><strong>The Mechanism:</strong> Anthropic released version <strong>2.1.88</strong> of the <code class="language-plaintext highlighter-rouge">@anthropic-ai/claude-code</code> npm package containing a massive <strong>59.8 MB Source Map (<code class="language-plaintext highlighter-rouge">.map</code>)</strong> file.</li>
  <li><strong>The Vulnerability:</strong> Source maps are designed to map minified, obfuscated production code back to its original <strong>TypeScript</strong> source for debugging. Due to a known but unpatched bug in the <strong>Bun Runtime</strong> (Issue #28001), the build process ignored the <code class="language-plaintext highlighter-rouge">minify: true</code> and <code class="language-plaintext highlighter-rouge">sourcemap: none</code> flags, bundling the full source mapping into the public artifact.</li>
  <li><strong>Data Exposure:</strong> Beyond the logic, the source map contained hardcoded pointers to an internal <strong>R2 Bucket</strong> (Cloudflare). This bucket was temporarily misconfigured with public read permissions, allowing users to reconstruct the entire repository.</li>
</ul>

<h2 id="2-architectural-revelations-deconstructing-the-agentic-harness">2. Architectural Revelations: Deconstructing the “Agentic Harness”</h2>

<p>The leak exposed over <strong>512,000 lines of TypeScript</strong>, revealing the sophisticated engineering required to make an LLM act as an autonomous developer.</p>

<h3 id="strict-write-discipline--memory-management">Strict Write Discipline &amp; Memory Management</h3>
<p>Claude Code utilizes a <strong>Self-Healing Memory</strong> architecture. Unlike simpler wrappers, it implements a <strong>Strict Write Discipline</strong> where the agent’s internal state (context) is only updated after a filesystem operation is verified via checksums. This effectively prevents “context drift” during long-running refactoring sessions.</p>

<h3 id="the-autodream-background-process">The “autoDream” Background Process</h3>
<p>The source code revealed a background optimization routine called <code class="language-plaintext highlighter-rouge">autoDream</code>. This process triggers during idle periods to consolidate session logs, prune redundant tokens, and “compress” the history into a high-density vector representation. The result is a significantly reduced <strong>TTFT (Time To First Token)</strong> for subsequent queries.</p>

<h3 id="tooling-orchestration">Tooling Orchestration</h3>
<p>The codebase manages a fleet of over <strong>40 specialized tools</strong>, showing advanced delegation including:</p>
<ul>
  <li><strong>Playwright Integration:</strong> For headless browser testing and UI verification.</li>
  <li><strong>PostgreSQL Adapters:</strong> For direct schema introspection and data migration.</li>
  <li><strong>LSP (Language Server Protocol) Clients:</strong> Enabling Claude to “understand” symbol definitions and references across a workspace without reading every file into the context window.</li>
</ul>

<p>This look into Anthropic’s secret sauce changes how developers understand LLM tooling—proving that building autonomous AI requires incredibly deep, classical engineering infrastructure alongside the prompts. The same principle applies to data extraction: reliable <a href="/tools/">scraping tools</a> rely on robust orchestration, proxy management, and anti-detection layers that go far beyond simple HTTP requests.</p>]]></content><author><name>Novi Develop</name></author><category term="Systems Engineering" /><category term="Security" /><category term="AI" /><category term="openai" /><category term="claude" /><category term="cybersecurity" /><category term="agentic-harness" /><summary type="html"><![CDATA[The “Great Claude Code Leak” of March 2026 stands as a watershed moment in the security of AI development tools. By inadvertently exposing the Agentic Harness of Anthropic’s flagship CLI tool, the incident provided an unprecedented look at high-level AI orchestration while highlighting critical vulnerabilities in modern JavaScript/TypeScript build pipelines.]]></summary></entry><entry><title type="html">Part 4: OAuth2 authentication and YouTube Data API uploads</title><link href="https://scrapingnerd.github.io/tiktok/tiktok-youtube-part4-scheduling-publishing/" rel="alternate" type="text/html" title="Part 4: OAuth2 authentication and YouTube Data API uploads" /><published>2026-03-27T08:00:00+07:00</published><updated>2026-03-27T08:00:00+07:00</updated><id>https://scrapingnerd.github.io/tiktok/tiktok-youtube-part4-scheduling-publishing</id><content type="html" xml:base="https://scrapingnerd.github.io/tiktok/tiktok-youtube-part4-scheduling-publishing/"><![CDATA[<p>This is the final component of our distributed pipeline. Having successfully managed the ETL process (Extract from TikTok, Transform via LLM/FFmpeg, Load to disk in <a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3</a>), we now map the normalized <code class="language-plaintext highlighter-rouge">.mp4</code> chunks into a destination platform: the YouTube Data API v3.</p>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><a href="/tiktok/tiktok-youtube-part1-searching-api/">Part 1: Interfacing with the Apify Python SDK</a></li>
    <li><a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2: Asynchronous video ingestion and connection pooling</a></li>
    <li><a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3: LLM script synthesis and FFmpeg concatenation</a></li>
    <li><strong>Part 4: OAuth2 authentication and YouTube Data API uploads</strong> ← You are here</li>
  </ul>
</blockquote>

<h2 id="architecture-and-dependency-context">Architecture and dependency context</h2>

<p>We require Google’s official client binding for proper OAuth authorization flows (<code class="language-plaintext highlighter-rouge">google-auth-oauthlib</code>) and the discovery API (<code class="language-plaintext highlighter-rouge">google-api-python-client</code>) for endpoint definitions.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>google-auth google-auth-oauthlib google-api-python-client
</code></pre></div></div>

<h2 id="managing-oauth2-credentials-lifecycle">Managing OAuth2 credentials lifecycle</h2>

<p>The primary challenge of uploading via YouTube API is maintaining the <code class="language-plaintext highlighter-rouge">access_token</code> lifecycle without continuous human execution. First, we securely request permissions against the local loopback interface, storing a serialized binary object (<code class="language-plaintext highlighter-rouge">.pickle</code>) containing the persistent <code class="language-plaintext highlighter-rouge">refresh_token</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># google_auth_handler.py
</span><span class="kn">import</span> <span class="n">pickle</span>
<span class="kn">from</span> <span class="n">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="kn">from</span> <span class="n">google.auth.transport.requests</span> <span class="kn">import</span> <span class="n">Request</span>
<span class="kn">from</span> <span class="n">google_auth_oauthlib.flow</span> <span class="kn">import</span> <span class="n">InstalledAppFlow</span>
<span class="kn">from</span> <span class="n">googleapiclient.discovery</span> <span class="kn">import</span> <span class="n">build</span>

<span class="n">SCOPES</span> <span class="o">=</span> <span class="p">[</span><span class="sh">"</span><span class="s">https://www.googleapis.com/auth/youtube.upload</span><span class="sh">"</span><span class="p">]</span>
<span class="n">TOKEN_CACHE</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">credentials/youtube_token.pickle</span><span class="sh">"</span><span class="p">)</span>
<span class="n">CLIENT_SECRET</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">credentials/client_secret.json</span><span class="sh">"</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">bootstrap_oauth_service</span><span class="p">():</span>
    <span class="sh">"""</span><span class="s">Initializes the YouTube API interface, rotating tokens automatically.</span><span class="sh">"""</span>
    <span class="n">creds</span> <span class="o">=</span> <span class="bp">None</span>
    
    <span class="c1"># Load previously serialized tokens to avoid continuous user authorization
</span>    <span class="k">if</span> <span class="n">TOKEN_CACHE</span><span class="p">.</span><span class="nf">exists</span><span class="p">():</span>
        <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">TOKEN_CACHE</span><span class="p">,</span> <span class="sh">"</span><span class="s">rb</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">token</span><span class="p">:</span>
            <span class="n">creds</span> <span class="o">=</span> <span class="n">pickle</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">token</span><span class="p">)</span>

    <span class="k">if</span> <span class="ow">not</span> <span class="n">creds</span> <span class="ow">or</span> <span class="ow">not</span> <span class="n">creds</span><span class="p">.</span><span class="n">valid</span><span class="p">:</span>
        <span class="c1"># Transparently solicit a new access_token if the short-lived token expired
</span>        <span class="k">if</span> <span class="n">creds</span> <span class="ow">and</span> <span class="n">creds</span><span class="p">.</span><span class="n">expired</span> <span class="ow">and</span> <span class="n">creds</span><span class="p">.</span><span class="n">refresh_token</span><span class="p">:</span>
            <span class="n">creds</span><span class="p">.</span><span class="nf">refresh</span><span class="p">(</span><span class="nc">Request</span><span class="p">())</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="c1"># Cold-start loopback auth
</span>            <span class="n">flow</span> <span class="o">=</span> <span class="n">InstalledAppFlow</span><span class="p">.</span><span class="nf">from_client_secrets_file</span><span class="p">(</span>
                <span class="nf">str</span><span class="p">(</span><span class="n">CLIENT_SECRET</span><span class="p">),</span> <span class="n">SCOPES</span>
            <span class="p">)</span>
            <span class="n">creds</span> <span class="o">=</span> <span class="n">flow</span><span class="p">.</span><span class="nf">run_local_server</span><span class="p">(</span><span class="n">port</span><span class="o">=</span><span class="mi">8090</span><span class="p">)</span>

        <span class="c1"># Serialize
</span>        <span class="n">TOKEN_CACHE</span><span class="p">.</span><span class="n">parent</span><span class="p">.</span><span class="nf">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">TOKEN_CACHE</span><span class="p">,</span> <span class="sh">"</span><span class="s">wb</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">token</span><span class="p">:</span>
            <span class="n">pickle</span><span class="p">.</span><span class="nf">dump</span><span class="p">(</span><span class="n">creds</span><span class="p">,</span> <span class="n">token</span><span class="p">)</span>

    <span class="k">return</span> <span class="nf">build</span><span class="p">(</span><span class="sh">"</span><span class="s">youtube</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">v3</span><span class="sh">"</span><span class="p">,</span> <span class="n">credentials</span><span class="o">=</span><span class="n">creds</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="resumable-media-chunks-and-standardizing-the-upload-api">Resumable media chunks and standardizing the upload API</h2>

<p>Video uploads fail unpredictably. Standard POST bodies break down over 5MB. Thus, we utilize <code class="language-plaintext highlighter-rouge">MediaFileUpload</code> to configure chunked, resumable POST transfers.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># sync_youtube.py
</span><span class="kn">from</span> <span class="n">googleapiclient.http</span> <span class="kn">import</span> <span class="n">MediaFileUpload</span>
<span class="kn">from</span> <span class="n">google_auth_handler</span> <span class="kn">import</span> <span class="n">bootstrap_oauth_service</span>

<span class="k">def</span> <span class="nf">transmit_artifact</span><span class="p">(</span><span class="n">youtube_service</span><span class="p">,</span> <span class="n">video_filepath</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">metadata</span><span class="p">:</span> <span class="nb">dict</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Executes a resumable file transmission of an `.mp4` artifact.</span><span class="sh">"""</span>
    
    <span class="c1"># Define request payload using YouTube's Data Schema
</span>    <span class="n">request_body</span> <span class="o">=</span> <span class="p">{</span>
        <span class="sh">"</span><span class="s">snippet</span><span class="sh">"</span><span class="p">:</span> <span class="p">{</span>
            <span class="sh">"</span><span class="s">title</span><span class="sh">"</span><span class="p">:</span> <span class="n">metadata</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">title</span><span class="sh">"</span><span class="p">),</span>
            <span class="sh">"</span><span class="s">description</span><span class="sh">"</span><span class="p">:</span> <span class="n">metadata</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">description</span><span class="sh">"</span><span class="p">),</span>
            <span class="sh">"</span><span class="s">tags</span><span class="sh">"</span><span class="p">:</span> <span class="n">metadata</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">tags</span><span class="sh">"</span><span class="p">,</span> <span class="p">[]),</span>
            <span class="sh">"</span><span class="s">categoryId</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">25</span><span class="sh">"</span> <span class="c1"># Explicitly cast to 'News &amp; Politics' category
</span>        <span class="p">},</span>
        <span class="sh">"</span><span class="s">status</span><span class="sh">"</span><span class="p">:</span> <span class="p">{</span>
            <span class="sh">"</span><span class="s">privacyStatus</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">private</span><span class="sh">"</span><span class="p">,</span> <span class="c1"># Stage immediately, authorize later
</span>            <span class="sh">"</span><span class="s">selfDeclaredMadeForKids</span><span class="sh">"</span><span class="p">:</span> <span class="bp">False</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="c1"># Transmit via 10 MB contiguous chunks
</span>    <span class="n">chunk_stream</span> <span class="o">=</span> <span class="nc">MediaFileUpload</span><span class="p">(</span>
        <span class="n">video_filepath</span><span class="p">,</span>
        <span class="n">mimetype</span><span class="o">=</span><span class="sh">"</span><span class="s">video/mp4</span><span class="sh">"</span><span class="p">,</span>
        <span class="n">resumable</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="n">chunksize</span><span class="o">=</span><span class="mi">10_485_760</span> 
    <span class="p">)</span>

    <span class="n">request</span> <span class="o">=</span> <span class="n">youtube_service</span><span class="p">.</span><span class="nf">videos</span><span class="p">().</span><span class="nf">insert</span><span class="p">(</span>
        <span class="n">part</span><span class="o">=</span><span class="sh">"</span><span class="s">snippet,status</span><span class="sh">"</span><span class="p">,</span>
        <span class="n">body</span><span class="o">=</span><span class="n">request_body</span><span class="p">,</span>
        <span class="n">media_body</span><span class="o">=</span><span class="n">chunk_stream</span>
    <span class="p">)</span>

    <span class="c1"># Poll server transmission rate
</span>    <span class="n">response</span> <span class="o">=</span> <span class="bp">None</span>
    <span class="k">while</span> <span class="n">response</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">status</span><span class="p">,</span> <span class="n">response</span> <span class="o">=</span> <span class="n">request</span><span class="p">.</span><span class="nf">next_chunk</span><span class="p">()</span>
        <span class="k">if</span> <span class="n">status</span><span class="p">:</span>
            <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Transmission progress: </span><span class="si">{</span><span class="nf">int</span><span class="p">(</span><span class="n">status</span><span class="p">.</span><span class="nf">progress</span><span class="p">()</span> <span class="o">*</span> <span class="mi">100</span><span class="p">)</span><span class="si">}</span><span class="s">%</span><span class="sh">"</span><span class="p">)</span>

    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Artifact persisted: https://youtu.be/</span><span class="si">{</span><span class="n">response</span><span class="p">[</span><span class="sh">'</span><span class="s">id</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="distributed-orchestration-vs-generic-cron-logic">Distributed orchestration vs generic cron logic</h2>

<p>While a monolithic <code class="language-plaintext highlighter-rouge">bash</code> script or basic <code class="language-plaintext highlighter-rouge">cron</code> daemon can wrap this four-step pipeline, serious deployments rely on containerized execution.</p>

<p><strong>Apify Webhooks</strong>: Using Apify as a task queue, you can bind a web endpoint to the <a href="https://apify.com/novi/advanced-search-tiktok-api?fpr=7hce1m">Advanced TikTok Search API Actor</a> directly. Post-execution, the webhook triggers a downstream CI/CD worker (e.g., GitHub Actions or AWS Lambda) which executes this normalized codebase sequentially.</p>

<p>By pushing the extraction layer onto a managed platform and keeping the transform and load steps within a localized microservice container, you fully isolate volatile web scraping states from stable application logic.</p>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><a href="/tiktok/tiktok-youtube-part1-searching-api/">Part 1: Interfacing with the Apify Python SDK</a></li>
    <li><a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2: Asynchronous video ingestion and connection pooling</a></li>
    <li><a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3: LLM script synthesis and FFmpeg concatenation</a></li>
    <li><strong>Part 4: OAuth2 authentication and YouTube Data API uploads</strong> ← You are here</li>
  </ul>
</blockquote>]]></content><author><name>Novi Develop</name></author><category term="Systems Engineering" /><category term="Video Automation" /><category term="Python" /><category term="youtube-api" /><category term="python" /><category term="oauth2" /><category term="automation" /><category term="cron" /><summary type="html"><![CDATA[This is the final component of our distributed pipeline. Having successfully managed the ETL process (Extract from TikTok, Transform via LLM/FFmpeg, Load to disk in Part 3), we now map the normalized .mp4 chunks into a destination platform: the YouTube Data API v3.]]></summary></entry><entry><title type="html">Part 3: LLM script synthesis and FFmpeg concatenation</title><link href="https://scrapingnerd.github.io/tiktok/tiktok-youtube-part3-ai-narration/" rel="alternate" type="text/html" title="Part 3: LLM script synthesis and FFmpeg concatenation" /><published>2026-03-26T08:00:00+07:00</published><updated>2026-03-26T08:00:00+07:00</updated><id>https://scrapingnerd.github.io/tiktok/tiktok-youtube-part3-ai-narration</id><content type="html" xml:base="https://scrapingnerd.github.io/tiktok/tiktok-youtube-part3-ai-narration/"><![CDATA[<p>This is Part 3 of our pipeline series. (See the <a href="/tiktok/youtube-news-channel-tiktok-data/">Architecture Overview</a> for context.) Having isolated independent <code class="language-plaintext highlighter-rouge">.mp4</code> chunks in our local filesystem from <a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2</a>, we must now parse metadata via OpenAI and concatenate the sequence using FFmpeg.</p>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><a href="/tiktok/tiktok-youtube-part1-searching-api/">Part 1: Interfacing with the Apify Python SDK</a></li>
    <li><a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2: Asynchronous video ingestion and connection pooling</a></li>
    <li><strong>Part 3: LLM script synthesis and FFmpeg concatenation</strong> ← You are here</li>
    <li><a href="/tiktok/tiktok-youtube-part4-scheduling-publishing/">Part 4: OAuth2 authentication and YouTube Data API uploads</a></li>
  </ul>
</blockquote>

<h2 id="engineering-dependencies">Engineering dependencies</h2>

<p>We interface with OpenAI’s asynchronous API wrapper and the command-line FFmpeg application. FFmpeg operations are delegated via Python’s <code class="language-plaintext highlighter-rouge">subprocess</code> for raw control over transcode flags.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>openai
</code></pre></div></div>
<p><em>Ensure the FFmpeg binary is installed on your OS environment (<code class="language-plaintext highlighter-rouge">sudo apt install ffmpeg</code> or <code class="language-plaintext highlighter-rouge">brew install ffmpeg</code>).</em></p>

<h2 id="prompt-engineering-the-llm-component">Prompt engineering the LLM component</h2>

<p>We construct a narrative bridging the disparate <code class="language-plaintext highlighter-rouge">.mp4</code> clips by pushing the accompanying <code class="language-plaintext highlighter-rouge">desc</code> metadata to an LLM context window.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># synthesize_script.py
</span><span class="kn">import</span> <span class="n">json</span>
<span class="kn">import</span> <span class="n">os</span>
<span class="kn">from</span> <span class="n">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>

<span class="n">SYSTEM_PROMPT</span> <span class="o">=</span> <span class="sh">"""</span><span class="s">
You are a factual, un-biased news synthesis engine. Given a JSON array of localized multimedia captions originating from social media on a specific geographic event, synthesize a tight 60-second broadcast script. Return raw text without markdown headers, HTML, commentary, or social media references.
</span><span class="sh">"""</span>

<span class="k">def</span> <span class="nf">synthesize_broadcast</span><span class="p">(</span><span class="n">api_key</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">topic</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">context</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Invokes OpenAI ChatGPT to generate continuous editorial.</span><span class="sh">"""</span>
    <span class="n">client</span> <span class="o">=</span> <span class="nc">OpenAI</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="n">api_key</span><span class="p">)</span>
    
    <span class="c1"># Prune elements beyond context threshold
</span>    <span class="n">pruned_context</span> <span class="o">=</span> <span class="n">context</span><span class="p">[:</span><span class="mi">20</span><span class="p">]</span> 
    
    <span class="n">payload</span> <span class="o">=</span> <span class="sa">f</span><span class="sh">"</span><span class="s">Topic Context: </span><span class="si">{</span><span class="n">topic</span><span class="si">}</span><span class="se">\n\n</span><span class="s">Ingestion Metadata:</span><span class="se">\n</span><span class="sh">"</span> 
    <span class="n">payload</span> <span class="o">+=</span> <span class="sh">"</span><span class="se">\n</span><span class="sh">"</span><span class="p">.</span><span class="nf">join</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">- </span><span class="si">{</span><span class="n">c</span><span class="si">}</span><span class="sh">"</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">pruned_context</span><span class="p">)</span>
    
    <span class="n">resp</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">chat</span><span class="p">.</span><span class="n">completions</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">gpt-4o-mini</span><span class="sh">"</span><span class="p">,</span>
        <span class="n">temperature</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span> <span class="c1"># Low variance
</span>        <span class="n">messages</span><span class="o">=</span><span class="p">[</span>
            <span class="p">{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">system</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="n">SYSTEM_PROMPT</span><span class="p">},</span>
            <span class="p">{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">user</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="n">payload</span><span class="p">}</span>
        <span class="p">]</span>
    <span class="p">)</span>
    <span class="k">return</span> <span class="n">resp</span><span class="p">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">message</span><span class="p">.</span><span class="n">content</span>

<span class="k">def</span> <span class="nf">synthesize_audio</span><span class="p">(</span><span class="n">api_key</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">script</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">out_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">Invokes the neural TTS engine.</span><span class="sh">"""</span>
    <span class="n">client</span> <span class="o">=</span> <span class="nc">OpenAI</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="n">api_key</span><span class="p">)</span>
    <span class="k">with</span> <span class="n">client</span><span class="p">.</span><span class="n">audio</span><span class="p">.</span><span class="n">speech</span><span class="p">.</span><span class="n">with_streaming_response</span><span class="p">.</span><span class="nf">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">tts-1</span><span class="sh">"</span><span class="p">,</span>
        <span class="n">voice</span><span class="o">=</span><span class="sh">"</span><span class="s">onyx</span><span class="sh">"</span><span class="p">,</span>
        <span class="nb">input</span><span class="o">=</span><span class="n">script</span><span class="p">,</span>
    <span class="p">)</span> <span class="k">as</span> <span class="n">response</span><span class="p">:</span>
        <span class="n">response</span><span class="p">.</span><span class="nf">stream_to_file</span><span class="p">(</span><span class="n">out_path</span><span class="p">)</span>
</code></pre></div></div>

<p>By strictly managing the <code class="language-plaintext highlighter-rouge">temperature</code> parameter and heavily biasing the systemic prompt against colloquialisms, we extract objective signals from unstructured social streams.</p>

<h2 id="demuxing-and-concatenating-via-ffmpeg">Demuxing and concatenating via FFmpeg</h2>

<p>With <code class="language-plaintext highlighter-rouge">narration.mp3</code> isolated, we construct an FFmpeg intermediate <code class="language-plaintext highlighter-rouge">.txt</code> concatenation file, enforce standardization via H.264 profiles (<code class="language-plaintext highlighter-rouge">libx264</code>), and map the audio stream natively, skipping arbitrary format transcodings.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># build_artifacts.py
</span><span class="kn">import</span> <span class="n">subprocess</span>
<span class="kn">from</span> <span class="n">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="k">def</span> <span class="nf">concat_and_mux</span><span class="p">(</span><span class="n">video_paths</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">Path</span><span class="p">],</span> <span class="n">audio_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">,</span> <span class="n">out_path</span><span class="p">:</span> <span class="n">Path</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">
    Standardizes aspect ratios, drops source audio channels, concatenates video tracks, 
    and muxes the secondary audio track.
    </span><span class="sh">"""</span>
    <span class="c1"># 1. Build FFmpeg concat instruction file
</span>    <span class="n">concat_txt</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">concat.txt</span><span class="sh">"</span><span class="p">)</span>
    <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">concat_txt</span><span class="p">,</span> <span class="sh">"</span><span class="s">w</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="k">for</span> <span class="n">vp</span> <span class="ow">in</span> <span class="n">video_paths</span><span class="p">:</span>
            <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">file </span><span class="sh">'</span><span class="si">{</span><span class="n">vp</span><span class="p">.</span><span class="nf">absolute</span><span class="p">()</span><span class="si">}</span><span class="sh">'</span><span class="se">\n</span><span class="sh">"</span><span class="p">)</span>

    <span class="c1"># 2. Transcode parameters to normalize resolutions to 1080p 16:9
</span>    <span class="n">tmp_vid</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">tmp_vid.mp4</span><span class="sh">"</span><span class="p">)</span>
    <span class="n">subprocess</span><span class="p">.</span><span class="nf">run</span><span class="p">([</span>
        <span class="sh">"</span><span class="s">ffmpeg</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-y</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-f</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">concat</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-safe</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">0</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">-i</span><span class="sh">"</span><span class="p">,</span> <span class="nf">str</span><span class="p">(</span><span class="n">concat_txt</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">-c:v</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">libx264</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-crf</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">24</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-preset</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">veryfast</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">-r</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">30</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-an</span><span class="sh">"</span><span class="p">,</span> <span class="c1"># Drop source audio track (-an)
</span>        <span class="sh">"</span><span class="s">-vf</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">scale=1920:1080:force_original_aspect_ratio=decrease,pad=1920:1080:(ow-iw)/2:(oh-ih)/2</span><span class="sh">"</span><span class="p">,</span>
        <span class="nf">str</span><span class="p">(</span><span class="n">tmp_vid</span><span class="p">)</span>
    <span class="p">],</span> <span class="n">check</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="c1"># 3. Mux audio and video tracks, truncate at shortest stream
</span>    <span class="n">subprocess</span><span class="p">.</span><span class="nf">run</span><span class="p">([</span>
        <span class="sh">"</span><span class="s">ffmpeg</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-y</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">-i</span><span class="sh">"</span><span class="p">,</span> <span class="nf">str</span><span class="p">(</span><span class="n">tmp_vid</span><span class="p">),</span> <span class="sh">"</span><span class="s">-i</span><span class="sh">"</span><span class="p">,</span> <span class="nf">str</span><span class="p">(</span><span class="n">audio_path</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">-c:v</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">copy</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-c:a</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">aac</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">-b:a</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">192k</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">-shortest</span><span class="sh">"</span><span class="p">,</span>
        <span class="nf">str</span><span class="p">(</span><span class="n">out_path</span><span class="p">)</span>
    <span class="p">],</span> <span class="n">check</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

    <span class="c1"># Teardown artifacts
</span>    <span class="n">concat_txt</span><span class="p">.</span><span class="nf">unlink</span><span class="p">()</span>
    <span class="n">tmp_vid</span><span class="p">.</span><span class="nf">unlink</span><span class="p">()</span>
</code></pre></div></div>

<p>This structural normalization guarantees that vertical TikTok codecs are padded appropriately without mutating hardware profiles, minimizing the container rendering overhead.</p>

<p>In <a href="/tiktok/tiktok-youtube-part4-scheduling-publishing/">Part 4: OAuth2 authentication and YouTube Data API uploads</a>, we discuss orchestrating the final stage via the <code class="language-plaintext highlighter-rouge">google-api-python-client</code>.</p>

<blockquote>
  <p><strong>Need more TikTok data inputs?</strong> You can also feed this pipeline with trending videos, hashtag data, or user profiles. See our full collection of <a href="/tools/">TikTok and Twitter scraping tools</a> for additional ingestion sources.</p>
</blockquote>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><a href="/tiktok/tiktok-youtube-part1-searching-api/">Part 1: Interfacing with the Apify Python SDK</a></li>
    <li><a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2: Asynchronous video ingestion and connection pooling</a></li>
    <li><strong>Part 3: LLM script synthesis and FFmpeg concatenation</strong> ← You are here</li>
    <li><a href="/tiktok/tiktok-youtube-part4-scheduling-publishing/">Part 4: OAuth2 authentication and YouTube Data API uploads</a></li>
  </ul>
</blockquote>]]></content><author><name>Novi Develop</name></author><category term="Systems Engineering" /><category term="Video Automation" /><category term="Python" /><category term="openai" /><category term="ffmpeg" /><category term="python" /><category term="pipeline" /><summary type="html"><![CDATA[This is Part 3 of our pipeline series. (See the Architecture Overview for context.) Having isolated independent .mp4 chunks in our local filesystem from Part 2, we must now parse metadata via OpenAI and concatenate the sequence using FFmpeg.]]></summary></entry><entry><title type="html">Part 2: Asynchronous video ingestion and connection pooling</title><link href="https://scrapingnerd.github.io/tiktok/tiktok-youtube-part2-downloading-videos/" rel="alternate" type="text/html" title="Part 2: Asynchronous video ingestion and connection pooling" /><published>2026-03-25T08:00:00+07:00</published><updated>2026-03-25T08:00:00+07:00</updated><id>https://scrapingnerd.github.io/tiktok/tiktok-youtube-part2-downloading-videos</id><content type="html" xml:base="https://scrapingnerd.github.io/tiktok/tiktok-youtube-part2-downloading-videos/"><![CDATA[<p>This is Part 2 of our pipeline series. (See the <a href="/tiktok/youtube-news-channel-tiktok-data/">Architecture Overview</a> for context.) After normalizing the dataset payload from the <a href="https://apify.com/novi/advanced-search-tiktok-api?fpr=7hce1m">Advanced TikTok Search API Actor</a>, our next challenge is physically ingesting the <code class="language-plaintext highlighter-rouge">.mp4</code> payloads from TikTok’s CDN endpoints.</p>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><a href="/tiktok/tiktok-youtube-part1-searching-api/">Part 1: Interfacing with the Apify Python SDK</a></li>
    <li><strong>Part 2: Asynchronous video ingestion and connection pooling</strong> ← You are here</li>
    <li><a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3: LLM script synthesis and FFmpeg concatenation</a></li>
    <li><a href="/tiktok/tiktok-youtube-part4-scheduling-publishing/">Part 4: OAuth2 authentication and YouTube Data API uploads</a></li>
  </ul>
</blockquote>

<h2 id="engineering-constraints">Engineering constraints</h2>

<p>Downloading 100+ videos synchronously causes substantial I/O blocking. Conversely, unrestrained parallel requests will exhaust file descriptors and trigger remote rate limiters (resulting in <code class="language-plaintext highlighter-rouge">403 Forbidden</code> or <code class="language-plaintext highlighter-rouge">429 Too Many Requests</code> responses from the CDN). We need bounded concurrency.</p>

<p>We use <code class="language-plaintext highlighter-rouge">httpx</code> for modern Python HTTP connection pooling, wrapped in an <code class="language-plaintext highlighter-rouge">asyncio.Semaphore</code> to throttle active multiplexed TLS streams.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>httpx asyncio
</code></pre></div></div>

<h2 id="bounded-concurrency-downloader-implementation">Bounded concurrency downloader implementation</h2>

<p>This script reads the normalized data lake JSON produced in Part 1, builds a target manifest, deduplicates operations using a <code class="language-plaintext highlighter-rouge">set</code> lookup (using <code class="language-plaintext highlighter-rouge">aweme_id</code> as the primary key), and initiates pooled GET requests to the <code class="language-plaintext highlighter-rouge">source_url</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># ingest.py
</span>
<span class="kn">import</span> <span class="n">asyncio</span>
<span class="kn">import</span> <span class="n">json</span>
<span class="kn">import</span> <span class="n">logging</span>
<span class="kn">import</span> <span class="n">sys</span>
<span class="kn">from</span> <span class="n">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="kn">import</span> <span class="n">httpx</span>

<span class="c1"># Throttle concurrent TCP connections to the CDN
</span><span class="n">MAX_CONCURRENT_RQS</span> <span class="o">=</span> <span class="mi">8</span>
<span class="n">TIMEOUT_SEC</span> <span class="o">=</span> <span class="mf">45.0</span>
<span class="n">DATA_LAKE_PATH</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">data/raw</span><span class="sh">"</span><span class="p">)</span>
<span class="n">INGEST_PATH</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">data/media_assets</span><span class="sh">"</span><span class="p">)</span>

<span class="n">logging</span><span class="p">.</span><span class="nf">basicConfig</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="p">.</span><span class="n">INFO</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="sh">"</span><span class="s">%(levelname)s: %(message)s</span><span class="sh">"</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">build_manifest</span><span class="p">(</span><span class="n">input_json</span><span class="p">:</span> <span class="n">Path</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Parses extract data and yields a deduplicated manifest dict.</span><span class="sh">"""</span>
    <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">input_json</span><span class="p">,</span> <span class="sh">"</span><span class="s">r</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">records</span> <span class="o">=</span> <span class="n">json</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>

    <span class="n">manifest</span> <span class="o">=</span> <span class="p">{}</span>
    <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">records</span><span class="p">:</span>
        <span class="n">uid</span> <span class="o">=</span> <span class="n">r</span><span class="p">[</span><span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">]</span>
        <span class="n">source_url</span> <span class="o">=</span> <span class="n">r</span><span class="p">[</span><span class="sh">"</span><span class="s">media</span><span class="sh">"</span><span class="p">][</span><span class="sh">"</span><span class="s">source_url</span><span class="sh">"</span><span class="p">]</span>
        
        <span class="c1"># Guard against malformed records
</span>        <span class="k">if</span> <span class="ow">not</span> <span class="n">uid</span> <span class="ow">or</span> <span class="ow">not</span> <span class="n">source_url</span><span class="p">:</span>
            <span class="k">continue</span>
            
        <span class="n">manifest</span><span class="p">[</span><span class="n">uid</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
            <span class="sh">"</span><span class="s">url</span><span class="sh">"</span><span class="p">:</span> <span class="n">source_url</span><span class="p">,</span>
            <span class="sh">"</span><span class="s">topic</span><span class="sh">"</span><span class="p">:</span> <span class="n">r</span><span class="p">[</span><span class="sh">"</span><span class="s">query_context</span><span class="sh">"</span><span class="p">].</span><span class="nf">replace</span><span class="p">(</span><span class="sh">"</span><span class="s"> </span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">_</span><span class="sh">"</span><span class="p">),</span>
            <span class="sh">"</span><span class="s">target_dir</span><span class="sh">"</span><span class="p">:</span> <span class="n">INGEST_PATH</span> <span class="o">/</span> <span class="n">r</span><span class="p">[</span><span class="sh">"</span><span class="s">query_context</span><span class="sh">"</span><span class="p">].</span><span class="nf">replace</span><span class="p">(</span><span class="sh">"</span><span class="s"> </span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">_</span><span class="sh">"</span><span class="p">),</span>
            <span class="sh">"</span><span class="s">target_filename</span><span class="sh">"</span><span class="p">:</span> <span class="sa">f</span><span class="sh">"</span><span class="si">{</span><span class="n">uid</span><span class="si">}</span><span class="s">.mp4</span><span class="sh">"</span>
        <span class="p">}</span>
            
    <span class="k">return</span> <span class="n">manifest</span>

<span class="k">async</span> <span class="k">def</span> <span class="nf">stream_to_disk</span><span class="p">(</span>
    <span class="n">client</span><span class="p">:</span> <span class="n">httpx</span><span class="p">.</span><span class="n">AsyncClient</span><span class="p">,</span> 
    <span class="n">semaphore</span><span class="p">:</span> <span class="n">asyncio</span><span class="p">.</span><span class="n">Semaphore</span><span class="p">,</span> 
    <span class="n">item_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> 
    <span class="n">job</span><span class="p">:</span> <span class="nb">dict</span>
<span class="p">):</span>
    <span class="sh">"""</span><span class="s">Downloads a video stream using bounded concurrency and exponential backoff.</span><span class="sh">"""</span>
    <span class="n">job</span><span class="p">[</span><span class="sh">"</span><span class="s">target_dir</span><span class="sh">"</span><span class="p">].</span><span class="nf">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="n">out_file</span> <span class="o">=</span> <span class="n">job</span><span class="p">[</span><span class="sh">"</span><span class="s">target_dir</span><span class="sh">"</span><span class="p">]</span> <span class="o">/</span> <span class="n">job</span><span class="p">[</span><span class="sh">"</span><span class="s">target_filename</span><span class="sh">"</span><span class="p">]</span>
    
    <span class="c1"># Idempotent execution
</span>    <span class="k">if</span> <span class="n">out_file</span><span class="p">.</span><span class="nf">exists</span><span class="p">():</span>
        <span class="n">logging</span><span class="p">.</span><span class="nf">debug</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">[</span><span class="si">{</span><span class="n">item_id</span><span class="si">}</span><span class="s">] Cache hit. Skipping.</span><span class="sh">"</span><span class="p">)</span>
        <span class="k">return</span> <span class="sh">"</span><span class="s">cache</span><span class="sh">"</span>

    <span class="k">async</span> <span class="k">with</span> <span class="n">semaphore</span><span class="p">:</span>
        <span class="k">for</span> <span class="n">attempt</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="mi">3</span><span class="p">):</span>
            <span class="k">try</span><span class="p">:</span>
                <span class="c1"># TikTok CDNs require a realistic User-Agent
</span>                <span class="n">headers</span> <span class="o">=</span> <span class="p">{</span>
                    <span class="sh">"</span><span class="s">User-Agent</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36</span><span class="sh">"</span><span class="p">,</span>
                    <span class="sh">"</span><span class="s">Referer</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">https://www.tiktok.com/</span><span class="sh">"</span>
                <span class="p">}</span>
                
                <span class="k">async</span> <span class="k">with</span> <span class="n">client</span><span class="p">.</span><span class="nf">stream</span><span class="p">(</span><span class="sh">"</span><span class="s">GET</span><span class="sh">"</span><span class="p">,</span> <span class="n">job</span><span class="p">[</span><span class="sh">"</span><span class="s">url</span><span class="sh">"</span><span class="p">],</span> <span class="n">headers</span><span class="o">=</span><span class="n">headers</span><span class="p">,</span> <span class="n">timeout</span><span class="o">=</span><span class="n">TIMEOUT_SEC</span><span class="p">)</span> <span class="k">as</span> <span class="n">resp</span><span class="p">:</span>
                    <span class="n">resp</span><span class="p">.</span><span class="nf">raise_for_status</span><span class="p">()</span>
                    
                    <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">out_file</span><span class="p">,</span> <span class="sh">"</span><span class="s">wb</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
                        <span class="k">async</span> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">resp</span><span class="p">.</span><span class="nf">aiter_bytes</span><span class="p">(</span><span class="n">chunk_size</span><span class="o">=</span><span class="mi">1024</span> <span class="o">*</span> <span class="mi">128</span><span class="p">):</span> <span class="c1"># 128KB chunks
</span>                            <span class="n">f</span><span class="p">.</span><span class="nf">write</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span>
                            
                <span class="k">return</span> <span class="sh">"</span><span class="s">ok</span><span class="sh">"</span>
            <span class="nf">except </span><span class="p">(</span><span class="n">httpx</span><span class="p">.</span><span class="n">HTTPError</span><span class="p">,</span> <span class="n">httpx</span><span class="p">.</span><span class="n">TimeoutException</span><span class="p">)</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
                <span class="n">logging</span><span class="p">.</span><span class="nf">warning</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">[</span><span class="si">{</span><span class="n">item_id</span><span class="si">}</span><span class="s">] Network fault </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">. Retry </span><span class="si">{</span><span class="n">attempt</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">/3</span><span class="sh">"</span><span class="p">)</span>
                <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="nf">sleep</span><span class="p">(</span><span class="mi">2</span> <span class="o">**</span> <span class="n">attempt</span><span class="p">)</span>  <span class="c1"># Exponential backoff
</span>        
    <span class="k">return</span> <span class="sh">"</span><span class="s">failed</span><span class="sh">"</span>

<span class="k">async</span> <span class="k">def</span> <span class="nf">orchestrate_ingestion</span><span class="p">(</span><span class="n">manifest</span><span class="p">:</span> <span class="nb">dict</span><span class="p">):</span>
    <span class="c1"># Enforces absolute limit on active connections
</span>    <span class="n">semaphore</span> <span class="o">=</span> <span class="n">asyncio</span><span class="p">.</span><span class="nc">Semaphore</span><span class="p">(</span><span class="n">MAX_CONCURRENT_RQS</span><span class="p">)</span>
    
    <span class="c1"># httpx.AsyncClient maintains a persistent connection pool (Keep-Alive)
</span>    <span class="k">async</span> <span class="k">with</span> <span class="n">httpx</span><span class="p">.</span><span class="nc">AsyncClient</span><span class="p">(</span><span class="n">http2</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">follow_redirects</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">client</span><span class="p">:</span>
        <span class="n">tasks</span> <span class="o">=</span> <span class="p">[</span>
            <span class="nf">stream_to_disk</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">semaphore</span><span class="p">,</span> <span class="n">uid</span><span class="p">,</span> <span class="n">job</span><span class="p">)</span>
            <span class="k">for</span> <span class="n">uid</span><span class="p">,</span> <span class="n">job</span> <span class="ow">in</span> <span class="n">manifest</span><span class="p">.</span><span class="nf">items</span><span class="p">()</span>
        <span class="p">]</span>
        
        <span class="c1"># Await completion of all ingestion coroutines
</span>        <span class="n">results</span> <span class="o">=</span> <span class="k">await</span> <span class="n">asyncio</span><span class="p">.</span><span class="nf">gather</span><span class="p">(</span><span class="o">*</span><span class="n">tasks</span><span class="p">)</span>
        
    <span class="n">oks</span> <span class="o">=</span> <span class="n">results</span><span class="p">.</span><span class="nf">count</span><span class="p">(</span><span class="sh">"</span><span class="s">ok</span><span class="sh">"</span><span class="p">)</span>
    <span class="n">caches</span> <span class="o">=</span> <span class="n">results</span><span class="p">.</span><span class="nf">count</span><span class="p">(</span><span class="sh">"</span><span class="s">cache</span><span class="sh">"</span><span class="p">)</span>
    <span class="n">fails</span> <span class="o">=</span> <span class="n">results</span><span class="p">.</span><span class="nf">count</span><span class="p">(</span><span class="sh">"</span><span class="s">failed</span><span class="sh">"</span><span class="p">)</span>
    <span class="n">logging</span><span class="p">.</span><span class="nf">info</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Ingestion complete: OK=</span><span class="si">{</span><span class="n">oks</span><span class="si">}</span><span class="s">, Cached=</span><span class="si">{</span><span class="n">caches</span><span class="si">}</span><span class="s">, Failed=</span><span class="si">{</span><span class="n">fails</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="sh">"</span><span class="s">__main__</span><span class="sh">"</span><span class="p">:</span>
    <span class="n">latest_extract</span> <span class="o">=</span> <span class="nf">sorted</span><span class="p">(</span><span class="n">DATA_LAKE_PATH</span><span class="p">.</span><span class="nf">glob</span><span class="p">(</span><span class="sh">"</span><span class="s">extract_*.json</span><span class="sh">"</span><span class="p">))[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
    <span class="n">logging</span><span class="p">.</span><span class="nf">info</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Mounting manifest from </span><span class="si">{</span><span class="n">latest_extract</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
    
    <span class="n">manifest</span> <span class="o">=</span> <span class="nf">build_manifest</span><span class="p">(</span><span class="n">latest_extract</span><span class="p">)</span>
    <span class="n">asyncio</span><span class="p">.</span><span class="nf">run</span><span class="p">(</span><span class="nf">orchestrate_ingestion</span><span class="p">(</span><span class="n">manifest</span><span class="p">))</span>
</code></pre></div></div>

<h2 id="infrastructure-design-notes">Infrastructure design notes</h2>

<ul>
  <li><strong>Chunked Writes (<code class="language-plaintext highlighter-rouge">aiter_bytes</code>)</strong>: Reading arbitrary 50MB files securely into memory scales poorly. Passing physical streams into file byte streams avoids buffer constraints.</li>
  <li><strong>Connection Pooling</strong>: Reusing the TLS connections to the same host over <code class="language-plaintext highlighter-rouge">http2</code> minimizes handshake bottlenecks, improving performance exponentially versus naive <code class="language-plaintext highlighter-rouge">requests.get()</code> inside a <code class="language-plaintext highlighter-rouge">ThreadPoolExecutor</code>.</li>
  <li><strong>Deduplication</strong>: By building a manifest dictionary indexed on <code class="language-plaintext highlighter-rouge">aweme_id</code>, multiple query strategies yielding the same video naturally deduplicate.</li>
</ul>

<p>In <a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3: LLM script synthesis and FFmpeg concatenation</a>, we will utilize <code class="language-plaintext highlighter-rouge">ffmpeg-python</code> bindings and ChatGPT to assemble these raw <code class="language-plaintext highlighter-rouge">.mp4</code> chunks into a single chronological presentation.</p>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><a href="/tiktok/tiktok-youtube-part1-searching-api/">Part 1: Interfacing with the Apify Python SDK</a></li>
    <li><strong>Part 2: Asynchronous video ingestion and connection pooling</strong> ← You are here</li>
    <li><a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3: LLM script synthesis and FFmpeg concatenation</a></li>
    <li><a href="/tiktok/tiktok-youtube-part4-scheduling-publishing/">Part 4: OAuth2 authentication and YouTube Data API uploads</a></li>
  </ul>
</blockquote>]]></content><author><name>Novi Develop</name></author><category term="Systems Engineering" /><category term="Video Automation" /><category term="Python" /><category term="apify" /><category term="httpx" /><category term="asyncio" /><category term="python" /><summary type="html"><![CDATA[This is Part 2 of our pipeline series. (See the Architecture Overview for context.) After normalizing the dataset payload from the Advanced TikTok Search API Actor, our next challenge is physically ingesting the .mp4 payloads from TikTok’s CDN endpoints.]]></summary></entry><entry><title type="html">Part 1: Interfacing with the Apify Python SDK for TikTok Extraction</title><link href="https://scrapingnerd.github.io/tiktok/tiktok-youtube-part1-searching-api/" rel="alternate" type="text/html" title="Part 1: Interfacing with the Apify Python SDK for TikTok Extraction" /><published>2026-03-24T08:00:00+07:00</published><updated>2026-03-24T08:00:00+07:00</updated><id>https://scrapingnerd.github.io/tiktok/tiktok-youtube-part1-searching-api</id><content type="html" xml:base="https://scrapingnerd.github.io/tiktok/tiktok-youtube-part1-searching-api/"><![CDATA[<p>This is Part 1 of our pipeline architecture series. (See the <a href="/tiktok/youtube-news-channel-tiktok-data/">Architecture Overview</a> for context.) Here, we implement the extraction layer using the <code class="language-plaintext highlighter-rouge">apify-client</code> SDK to interface with the <a href="https://apify.com/novi/advanced-search-tiktok-api?fpr=7hce1m">Advanced TikTok Search API Actor</a>.</p>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><strong>Part 1: Interfacing with the Apify Python SDK</strong> ← You are here</li>
    <li><a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2: Asynchronous video ingestion and connection pooling</a></li>
    <li><a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3: LLM script synthesis and FFmpeg concatenation</a></li>
    <li><a href="/tiktok/tiktok-youtube-part4-scheduling-publishing/">Part 4: OAuth2 authentication and YouTube Data API uploads</a></li>
  </ul>
</blockquote>

<h2 id="infrastructure-dependencies">Infrastructure dependencies</h2>

<p>We rely on the official Apify SDK to manage API requests and dataset pagination to Apify’s infrastructure.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>apify-client python-dotenv
</code></pre></div></div>

<p>Store your <code class="language-plaintext highlighter-rouge">APIFY_TOKEN</code> securely in environment variables or a <code class="language-plaintext highlighter-rouge">.env</code> file. Never hardcode access tokens in version control.</p>

<h2 id="schema-configurations-and-targeted-queries">Schema configurations and targeted queries</h2>

<p>The actor strictly defines an input schema. When extracting data for news events, geographical relevance (<code class="language-plaintext highlighter-rouge">region</code>) and recency (<code class="language-plaintext highlighter-rouge">publishTime</code>) are critical query parameters.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># config.py
</span>
<span class="kn">import</span> <span class="n">os</span>
<span class="kn">from</span> <span class="n">dotenv</span> <span class="kn">import</span> <span class="n">load_dotenv</span>

<span class="nf">load_dotenv</span><span class="p">()</span>

<span class="n">APIFY_TOKEN</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">APIFY_TOKEN</span><span class="sh">"</span><span class="p">)</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">APIFY_TOKEN</span><span class="p">:</span>
    <span class="k">raise</span> <span class="nc">EnvironmentError</span><span class="p">(</span><span class="sh">"</span><span class="s">Missing APIFY_TOKEN boundary condition.</span><span class="sh">"</span><span class="p">)</span>

<span class="c1"># Input schema matching the Advanced TikTok Search API
</span><span class="n">SEARCH_PAYLOADS</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">{</span>
        <span class="sh">"</span><span class="s">keyword</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">Ukraine frontline</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">region</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">UA</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">sortType</span><span class="sh">"</span><span class="p">:</span> <span class="mi">2</span><span class="p">,</span>       <span class="c1"># 2 maps to "Most recent"
</span>        <span class="sh">"</span><span class="s">publishTime</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">YESTERDAY</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">limit</span><span class="sh">"</span><span class="p">:</span> <span class="mi">50</span><span class="p">,</span>
    <span class="p">},</span>
    <span class="p">{</span>
        <span class="sh">"</span><span class="s">keyword</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">France protests</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">region</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">FR</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">sortType</span><span class="sh">"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>       <span class="c1"># 1 maps to "Most liked"
</span>        <span class="sh">"</span><span class="s">publishTime</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">WEEK</span><span class="sh">"</span><span class="p">,</span>
        <span class="sh">"</span><span class="s">limit</span><span class="sh">"</span><span class="p">:</span> <span class="mi">30</span><span class="p">,</span>
    <span class="p">}</span>
<span class="p">]</span>
</code></pre></div></div>

<h2 id="executing-the-extraction-pipeline">Executing the extraction pipeline</h2>

<p>To separate raw HTTP responses from our pipeline’s expected schema, we implement a normalizer (<code class="language-plaintext highlighter-rouge">normalize_payload</code>) that extracts only the specific fields required downstream (e.g., <code class="language-plaintext highlighter-rouge">aweme_id</code>, <code class="language-plaintext highlighter-rouge">video_url_no_watermark</code>, <code class="language-plaintext highlighter-rouge">desc</code>). Storing the entire raw payload from the actor wastes disk space and database I/O bandwidth.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># extract.py
</span>
<span class="kn">import</span> <span class="n">json</span>
<span class="kn">from</span> <span class="n">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>
<span class="kn">from</span> <span class="n">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>

<span class="kn">from</span> <span class="n">apify_client</span> <span class="kn">import</span> <span class="n">ApifyClient</span>
<span class="kn">from</span> <span class="n">config</span> <span class="kn">import</span> <span class="n">APIFY_TOKEN</span><span class="p">,</span> <span class="n">SEARCH_PAYLOADS</span>

<span class="n">ACTOR_ID</span> <span class="o">=</span> <span class="sh">"</span><span class="s">novi/advanced-search-tiktok-api</span><span class="sh">"</span>
<span class="n">DATA_LAKE_PATH</span> <span class="o">=</span> <span class="nc">Path</span><span class="p">(</span><span class="sh">"</span><span class="s">data/raw</span><span class="sh">"</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">normalize_payload</span><span class="p">(</span><span class="n">raw_item</span><span class="p">:</span> <span class="nb">dict</span><span class="p">,</span> <span class="n">original_query</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Normalizes the raw JSON schema into a consistent pipeline format.</span><span class="sh">"""</span>
    <span class="n">stats</span> <span class="o">=</span> <span class="n">raw_item</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">statistics</span><span class="sh">"</span><span class="p">,</span> <span class="p">{})</span>
    <span class="n">play_addr</span> <span class="o">=</span> <span class="n">raw_item</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">video</span><span class="sh">"</span><span class="p">,</span> <span class="p">{}).</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">play_addr</span><span class="sh">"</span><span class="p">,</span> <span class="p">{})</span>

    <span class="k">return</span> <span class="p">{</span>
        <span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">:</span> <span class="n">raw_item</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">aweme_id</span><span class="sh">"</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">description</span><span class="sh">"</span><span class="p">:</span> <span class="n">raw_item</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">desc</span><span class="sh">"</span><span class="p">,</span> <span class="sh">""</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">region</span><span class="sh">"</span><span class="p">:</span> <span class="n">raw_item</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">region</span><span class="sh">"</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">timestamp_unix</span><span class="sh">"</span><span class="p">:</span> <span class="n">raw_item</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">create_time</span><span class="sh">"</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">metrics</span><span class="sh">"</span><span class="p">:</span> <span class="p">{</span>
            <span class="sh">"</span><span class="s">views</span><span class="sh">"</span><span class="p">:</span> <span class="n">stats</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">play_count</span><span class="sh">"</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span>
            <span class="sh">"</span><span class="s">likes</span><span class="sh">"</span><span class="p">:</span> <span class="n">stats</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">digg_count</span><span class="sh">"</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span>
            <span class="sh">"</span><span class="s">shares</span><span class="sh">"</span><span class="p">:</span> <span class="n">stats</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">share_count</span><span class="sh">"</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span>
        <span class="p">},</span>
        <span class="sh">"</span><span class="s">media</span><span class="sh">"</span><span class="p">:</span> <span class="p">{</span>
            <span class="c1"># Extract the first available CDN URL; prioritizing the watermark-free address
</span>            <span class="sh">"</span><span class="s">source_url</span><span class="sh">"</span><span class="p">:</span> <span class="n">play_addr</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">url_list</span><span class="sh">"</span><span class="p">,</span> <span class="p">[</span><span class="bp">None</span><span class="p">])[</span><span class="mi">0</span><span class="p">],</span>
            <span class="sh">"</span><span class="s">duration_ms</span><span class="sh">"</span><span class="p">:</span> <span class="n">raw_item</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">video</span><span class="sh">"</span><span class="p">,</span> <span class="p">{}).</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">duration</span><span class="sh">"</span><span class="p">),</span>
        <span class="p">},</span>
        <span class="sh">"</span><span class="s">query_context</span><span class="sh">"</span><span class="p">:</span> <span class="n">original_query</span>
    <span class="p">}</span>

<span class="k">def</span> <span class="nf">fetch_datasets</span><span class="p">():</span>
    <span class="n">client</span> <span class="o">=</span> <span class="nc">ApifyClient</span><span class="p">(</span><span class="n">APIFY_TOKEN</span><span class="p">)</span>
    <span class="n">normalized_results</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">for</span> <span class="n">payload</span> <span class="ow">in</span> <span class="n">SEARCH_PAYLOADS</span><span class="p">:</span>
        <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Triggering execution for: </span><span class="si">{</span><span class="n">payload</span><span class="p">[</span><span class="sh">'</span><span class="s">keyword</span><span class="sh">'</span><span class="p">]</span><span class="si">}</span><span class="sh">"</span><span class="p">)</span>
        
        <span class="c1"># Blocking call: waits for the actor to complete execution on Apify
</span>        <span class="n">run</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="nf">actor</span><span class="p">(</span><span class="n">ACTOR_ID</span><span class="p">).</span><span class="nf">call</span><span class="p">(</span><span class="n">run_input</span><span class="o">=</span><span class="n">payload</span><span class="p">)</span>
        
        <span class="c1"># Paginate through the associated dataset
</span>        <span class="n">items_iterator</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="nf">dataset</span><span class="p">(</span><span class="n">run</span><span class="p">[</span><span class="sh">"</span><span class="s">defaultDatasetId</span><span class="sh">"</span><span class="p">]).</span><span class="nf">iterate_items</span><span class="p">()</span>
        
        <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">items_iterator</span><span class="p">:</span>
            <span class="n">normalized_results</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="nf">normalize_payload</span><span class="p">(</span><span class="n">item</span><span class="p">,</span> <span class="n">payload</span><span class="p">[</span><span class="sh">'</span><span class="s">keyword</span><span class="sh">'</span><span class="p">]))</span>
            
    <span class="k">return</span> <span class="n">normalized_results</span>

<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="sh">"</span><span class="s">__main__</span><span class="sh">"</span><span class="p">:</span>
    <span class="n">results</span> <span class="o">=</span> <span class="nf">fetch_datasets</span><span class="p">()</span>
    
    <span class="n">DATA_LAKE_PATH</span><span class="p">.</span><span class="nf">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="n">out_file</span> <span class="o">=</span> <span class="n">DATA_LAKE_PATH</span> <span class="o">/</span> <span class="sa">f</span><span class="sh">"</span><span class="s">extract_</span><span class="si">{</span><span class="n">datetime</span><span class="p">.</span><span class="nf">now</span><span class="p">().</span><span class="nf">strftime</span><span class="p">(</span><span class="sh">'</span><span class="s">%Y%m%d%H%M</span><span class="sh">'</span><span class="p">)</span><span class="si">}</span><span class="s">.json</span><span class="sh">"</span>
    
    <span class="k">with</span> <span class="nf">open</span><span class="p">(</span><span class="n">out_file</span><span class="p">,</span> <span class="sh">"</span><span class="s">w</span><span class="sh">"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="n">json</span><span class="p">.</span><span class="nf">dump</span><span class="p">(</span><span class="n">results</span><span class="p">,</span> <span class="n">f</span><span class="p">,</span> <span class="n">separators</span><span class="o">=</span><span class="p">(</span><span class="sh">'</span><span class="s">,</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">:</span><span class="sh">'</span><span class="p">))</span> <span class="c1"># Minified JSON
</span>        
    <span class="nf">print</span><span class="p">(</span><span class="sa">f</span><span class="sh">"</span><span class="s">Extracted </span><span class="si">{</span><span class="nf">len</span><span class="p">(</span><span class="n">results</span><span class="p">)</span><span class="si">}</span><span class="s"> records to local storage.</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<p>By explicitly flattening nested hierarchies into <code class="language-plaintext highlighter-rouge">metrics</code> and <code class="language-plaintext highlighter-rouge">media</code> dictionaries, we decouple the actor’s API shape from our internal domain model.</p>

<p>In <a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2: Asynchronous video ingestion and connection pooling</a>, we write a concurrent parser using <code class="language-plaintext highlighter-rouge">asyncio</code> and <code class="language-plaintext highlighter-rouge">httpx</code> to actually fetch the <code class="language-plaintext highlighter-rouge">.mp4</code> payloads from the retrieved CDN URLs.</p>

<blockquote>
  <p><strong>Series Navigation</strong></p>
  <ul>
    <li><strong>Part 1: Interfacing with the Apify Python SDK</strong> ← You are here</li>
    <li><a href="/tiktok/tiktok-youtube-part2-downloading-videos/">Part 2: Asynchronous video ingestion and connection pooling</a></li>
    <li><a href="/tiktok/tiktok-youtube-part3-ai-narration/">Part 3: LLM script synthesis and FFmpeg concatenation</a></li>
    <li><a href="/tiktok/tiktok-youtube-part4-scheduling-publishing/">Part 4: OAuth2 authentication and YouTube Data API uploads</a></li>
  </ul>
</blockquote>]]></content><author><name>Novi Develop</name></author><category term="Systems Engineering" /><category term="Video Automation" /><category term="Python" /><category term="apify" /><category term="tiktok-api" /><category term="python" /><category term="etl" /><summary type="html"><![CDATA[This is Part 1 of our pipeline architecture series. (See the Architecture Overview for context.) Here, we implement the extraction layer using the apify-client SDK to interface with the Advanced TikTok Search API Actor.]]></summary></entry></feed>