Karim shoair commited on
Commit ·
eca1c09
1
Parent(s): 8d5cc87
docs: update replacing AI article
Browse files- docs/fetching/stealthy.md +1 -1
- docs/tutorials/replacing_ai.md +11 -11
docs/fetching/stealthy.md
CHANGED
|
@@ -66,7 +66,7 @@ Scrapling provides many options with this fetcher and its session classes. Befor
|
|
| 66 |
| solve_cloudflare | When enabled, fetcher solves all types of Cloudflare's Turnstile/Interstitial challenges before returning the response to you. | ✔️ |
|
| 67 |
| block_webrtc | Forces WebRTC to respect proxy settings to prevent local IP address leak. | ✔️ |
|
| 68 |
| hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
|
| 69 |
-
| allow_webgl | Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled.
|
| 70 |
| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 71 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
| 72 |
|
|
|
|
| 66 |
| solve_cloudflare | When enabled, fetcher solves all types of Cloudflare's Turnstile/Interstitial challenges before returning the response to you. | ✔️ |
|
| 67 |
| block_webrtc | Forces WebRTC to respect proxy settings to prevent local IP address leak. | ✔️ |
|
| 68 |
| hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
|
| 69 |
+
| allow_webgl | Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | ✔️ |
|
| 70 |
| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 71 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
| 72 |
|
docs/tutorials/replacing_ai.md
CHANGED
|
@@ -26,13 +26,13 @@ How will you solve that manually? I'm referring to generic web scraping of vario
|
|
| 26 |
|
| 27 |
## AI to the rescue, but at a high cost
|
| 28 |
|
| 29 |
-
Of course,
|
| 30 |
|
| 31 |
This approach is, of course, beautiful. I love AI and find it very fascinating, especially Generative AI. You will probably spend a lot of time on prompt engineering and tweaking the prompts, but if that's cool with you, you will soon hit the real issue with using AI here.
|
| 32 |
|
| 33 |
Most websites have vast amounts of content per page, which you will need to pass to the AI somehow so it can do its magic. This will burn through tokens like fire in a haystack, quickly accumulating high costs.
|
| 34 |
|
| 35 |
-
Unless money is irrelevant to you, you will try to find less expensive approaches, and that's
|
| 36 |
|
| 37 |
## Scrapling got you covered
|
| 38 |
|
|
@@ -41,11 +41,11 @@ Scrapling can handle almost all issues you will face during Web Scraping, and th
|
|
| 41 |
### Solving issue T1: Rapidly changing website structures
|
| 42 |
That's why the [adaptive](https://scrapling.readthedocs.io/en/latest/parsing/adaptive/) feature was made. You knew I would talk about it, and here we are :)
|
| 43 |
|
| 44 |
-
While Web Scraping, if you have the `adaptive` feature enabled, you can save any element's unique properties
|
| 45 |
|
| 46 |
That's how the adaptive feature works: it stores everything unique about an element. When the website structure changes, it returns the element with the highest similarity score of the previous element.
|
| 47 |
|
| 48 |
-
I have already explained
|
| 49 |
|
| 50 |
### Solving issue T2: Unstable selectors
|
| 51 |
If you have been doing Web scraping for a long enough time, you have likely experienced this once. I'm referring to a website that employs poor design patterns, built on raw HTML without any IDs/classes, or uses random class names with nothing else to rely on, etc...
|
|
@@ -59,16 +59,16 @@ In these cases, standard selection methods with CSS/XPath selectors won't be opt
|
|
| 59 |
There is no need to explain any of these; click on the links, and it will be clear how Scrapling solves this.
|
| 60 |
|
| 61 |
### Solving issue T3: Increasingly complex anti-bot measures
|
| 62 |
-
It's known that
|
| 63 |
|
| 64 |
-
1. [DynamicFetcher](https://scrapling.readthedocs.io/en/latest/fetching/dynamic/) — This fetcher provides
|
| 65 |
-
2. [StealthyFetcher](https://scrapling.readthedocs.io/en/latest/fetching/stealthy/) — Because we live in a harsh world and you need to take [full measure instead of half-measures](https://www.youtube.com/watch?v=7BE4QcwX4dU), `StealthyFetcher` was born. This fetcher
|
| 66 |
|
| 67 |
-
|
| 68 |
|
| 69 |
### Solving issues B1 & B2: Extreme Website Diversity / Identifying Relevant Data
|
| 70 |
|
| 71 |
-
This one is tough to handle, but
|
| 72 |
|
| 73 |
I talked with someone who uses AI to extract prices from different websites. He is only interested in prices and titles, so he uses AI to find the price for him.
|
| 74 |
|
|
@@ -94,7 +94,7 @@ It will be a bit boring, but it's definitely less expensive than AI.
|
|
| 94 |
This example illustrates the point I aim to convey here. Not every challenge will need AI to be solved, but sometimes you need to be creative, and that might save you a lot of money.
|
| 95 |
|
| 96 |
### Solving issue B3: Pagination variations
|
| 97 |
-
This issue, Scrapling currently doesn't have a direct method to automatically extract pagination's URLs for you, but it will be added with the
|
| 98 |
|
| 99 |
But you can handle most websites if you search for the most common patterns with `page.find_by_text('Next')['href']` or `page.find_by_text('load more')['href']` or selectors like `'a[href*="?page="]'` or `'a[href*="/page/"]'`—you get the idea.
|
| 100 |
|
|
@@ -112,6 +112,6 @@ For a quick comparison.
|
|
| 112 |
This table is based on pricing from [Browse AI Pricing](https://www.browse.ai/pricing) and [Oxylabs Web Scraper API Pricing](https://oxylabs.io/products/scraper-api/web/pricing)
|
| 113 |
|
| 114 |
## Conclusion
|
| 115 |
-
While AI offers powerful capabilities, its cost can be prohibitive for many Web scraping tasks. Scrapling provides a robust, flexible, and cost-effective toolkit
|
| 116 |
|
| 117 |
Explore the documentation further and see how Scrapling can simplify your future Web Scraping projects!
|
|
|
|
| 26 |
|
| 27 |
## AI to the rescue, but at a high cost
|
| 28 |
|
| 29 |
+
Of course, AI can easily solve most of these issues because it can understand the page source and identify the fields you want or create selectors for them. That's, of course, if you already solved the anti-bot measures through other tools :)
|
| 30 |
|
| 31 |
This approach is, of course, beautiful. I love AI and find it very fascinating, especially Generative AI. You will probably spend a lot of time on prompt engineering and tweaking the prompts, but if that's cool with you, you will soon hit the real issue with using AI here.
|
| 32 |
|
| 33 |
Most websites have vast amounts of content per page, which you will need to pass to the AI somehow so it can do its magic. This will burn through tokens like fire in a haystack, quickly accumulating high costs.
|
| 34 |
|
| 35 |
+
Unless money is irrelevant to you, you will try to find less expensive approaches, and that's where Scrapling comes into play :smile:
|
| 36 |
|
| 37 |
## Scrapling got you covered
|
| 38 |
|
|
|
|
| 41 |
### Solving issue T1: Rapidly changing website structures
|
| 42 |
That's why the [adaptive](https://scrapling.readthedocs.io/en/latest/parsing/adaptive/) feature was made. You knew I would talk about it, and here we are :)
|
| 43 |
|
| 44 |
+
While Web Scraping, if you have the `adaptive` feature enabled, you can save any element's unique properties so you can find it again later when the website's structure changes. The most frustrating thing about changes is that anything about an element can change, so there's nothing to rely on.
|
| 45 |
|
| 46 |
That's how the adaptive feature works: it stores everything unique about an element. When the website structure changes, it returns the element with the highest similarity score of the previous element.
|
| 47 |
|
| 48 |
+
I have already explained this in more detail, with many examples. Read more from [here](https://scrapling.readthedocs.io/en/latest/parsing/adaptive/#how-the-adaptive-feature-works).
|
| 49 |
|
| 50 |
### Solving issue T2: Unstable selectors
|
| 51 |
If you have been doing Web scraping for a long enough time, you have likely experienced this once. I'm referring to a website that employs poor design patterns, built on raw HTML without any IDs/classes, or uses random class names with nothing else to rely on, etc...
|
|
|
|
| 59 |
There is no need to explain any of these; click on the links, and it will be clear how Scrapling solves this.
|
| 60 |
|
| 61 |
### Solving issue T3: Increasingly complex anti-bot measures
|
| 62 |
+
It's well known that creating an undetectable spider requires more than residential/mobile proxies and human-like behavior. It also needs a hard-to-detect browser, which Scrapling provides two main options to solve:
|
| 63 |
|
| 64 |
+
1. [DynamicFetcher](https://scrapling.readthedocs.io/en/latest/fetching/dynamic/) — This fetcher provides flexible browser automation with multiple configuration options and little under-the-hood stealth improvements.
|
| 65 |
+
2. [StealthyFetcher](https://scrapling.readthedocs.io/en/latest/fetching/stealthy/) — Because we live in a harsh world and you need to take [full measure instead of half-measures](https://www.youtube.com/watch?v=7BE4QcwX4dU), `StealthyFetcher` was born. This fetcher uses our stealthy browser -- a version of [DynamicFetcher](https://scrapling.readthedocs.io/en/latest/fetching/dynamic/) that nearly bypasses all annoying anti-protections, provides tools to handle the rest, and automatically bypasses all types of Cloudflare's Turnstile/Interstitial!
|
| 66 |
|
| 67 |
+
We keep improving these two with each update, so stay tuned :)
|
| 68 |
|
| 69 |
### Solving issues B1 & B2: Extreme Website Diversity / Identifying Relevant Data
|
| 70 |
|
| 71 |
+
This one is tough to handle, but Scrapling's flexibility makes it possible.
|
| 72 |
|
| 73 |
I talked with someone who uses AI to extract prices from different websites. He is only interested in prices and titles, so he uses AI to find the price for him.
|
| 74 |
|
|
|
|
| 94 |
This example illustrates the point I aim to convey here. Not every challenge will need AI to be solved, but sometimes you need to be creative, and that might save you a lot of money.
|
| 95 |
|
| 96 |
### Solving issue B3: Pagination variations
|
| 97 |
+
This issue, Scrapling currently doesn't have a direct method to automatically extract pagination's URLs for you, but it will be added with the upcoming updates :)
|
| 98 |
|
| 99 |
But you can handle most websites if you search for the most common patterns with `page.find_by_text('Next')['href']` or `page.find_by_text('load more')['href']` or selectors like `'a[href*="?page="]'` or `'a[href*="/page/"]'`—you get the idea.
|
| 100 |
|
|
|
|
| 112 |
This table is based on pricing from [Browse AI Pricing](https://www.browse.ai/pricing) and [Oxylabs Web Scraper API Pricing](https://oxylabs.io/products/scraper-api/web/pricing)
|
| 113 |
|
| 114 |
## Conclusion
|
| 115 |
+
While AI offers powerful capabilities, its cost can be prohibitive for many Web scraping tasks. Scrapling provides a robust, flexible, and cost-effective toolkit for tackling the real-world challenges of both targeted and broad scraping, often eliminating the need for expensive AI solutions. You can build resilient scrapers more efficiently by leveraging features like `adaptive`, diverse selection methods, and advanced fetchers.
|
| 116 |
|
| 117 |
Explore the documentation further and see how Scrapling can simplify your future Web Scraping projects!
|