内容提取API是一种高级工具,旨在便捷地从网页中提取干净且结构化的文本内容。此API特别针对那些需要高效且准确地从网络中获取和分析文本数据的用户。通过一系列专业的端点,该API允许将网页内容转换为干净的文本和markdown格式,适应各种数据处理和分析需求。
主要功能
干净文本提取:第一个API端点专注于提供网页的干净文本内容。该端点会移除广告、菜单和侧边栏等不必要的元素,只保留相关和有意义的文本。干净文本提取非常适合需要清晰、无格式内容进行分析或展示的应用,如自动摘要、搜索引擎或内容分析工具。
Markdown转换:第二个端点将网页内容转换为markdown格式。Markdown是一种轻量级标记语言,允许文本以简单的方式结构化,便于其集成到使用这种格式进行文档生成、博客文章或内容管理的应用中。
支持不同类型的页面:内容提取API设计用于处理各种类型的网页,从静态站点到由JavaScript生成的动态页面。这确保用户能够从几乎任何类型的页面中提取内容,无论其复杂程度或结构如何。
简而言之,内容提取API提供了从网页中提取和转换文本内容的高级解决方案。凭借其专业的干净文本和markdown端点,它为用户提供了有效的工具,以在有用的格式中获取和管理网络数据,适应各种应用和需求。其灵活性和集成能力使其成为处理和分析网络内容的任何任务的宝贵选择。
此API接收网页URL并提供从该页面提取的干净文本或markdown格式内容。
博客内容生成:将网页内容转换为markdown格式,以便轻松集成到博客平台或内容管理系统中,促进发布和编辑。
市场调研的数据收集:从各种网页中提取干净文本,以收集市场趋势、消费者行为或竞争分析的数据。
新闻摘要自动化:使用文本提取器通过去除不相关元素并聚焦于主要内容来创建自动新闻摘要。
技术文档创建:将网页内容转换为markdown,以开发集成到协作文档系统中的技术文档或用户指南。
SEO工具的数据提取:从网页中提取干净文本,分析内容并优化SEO策略,识别相关的关键词和主题。
除每月允许的API调用次数外,没有其他限制。
要使用此端点,请发送一个请求,包括网页的URL,并接收从该页面内容中提取的干净文本
提取信息 - 端点功能
| 对象 | 描述 |
|---|---|
请求体 |
[必需] Json |
{"response":"Spark Basics\nSuppose we have a web application hosted in an application orchestrator like kubernetes. If load in that particular application increases then we can horizontally scale our application simply by increasing the number of pods in our service.\nNow let’s suppose there is heavy compute operation happening in each of the pods. Then there will be certain limit upto which these services can run because unlike horizontal scaling where you can have as many numbers of machines as required, there is limit for vertical scaling because you can’t have unlimited ram and cpu cores for each of the machines in a cluster. Distributed Computing removes this limitation of vertical scaling by distributing the processing across cluster of machines. Now, a group of machines alone is not powerful, you need a framework to coordinate work across them. Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers. The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like Spark’s standalone cluster manager, Kubernetes, YARN, or Mesos.\nSpark Basics\nSpark is distributed data processing engine. Distributed data processing in big data is simply series of map and reduce functions which runs across the cluster machines. Given below is python code for calculating the sum of all the even numbers from a given list with the help of map and reduce functions.\nfrom functools import reduce\na = [1,2,3,4,5]\nres = reduce(lambda x,y: x+y, (map(lambda x: x if x%2==0 else 0, a)))\nNow consider, if instead of a simple list, it is a parquet file of size in order of gigabytes. Computation with MapReduce system becomes optimized way of dealing with such problems. In this case spark will load the big parquet file into multiple worker nodes (if the file doesn’t support distributed storage then it will be first loaded into driver node and afterwards, it will get distributed across the worker nodes). Then map function will be executed for each task in each worker node and the final result will fetched with the reduce function.\nSpark timeline\nGoogle was first to introduce large scale distributed computing solution with MapReduce and its own distributed file system i.e., Google File System(GFS). GFS provided a blueprint for the Hadoop File System (HDFS), including the MapReduce implementation as a framework for distributed computing. Apache Hadoop framework was developed consisting of Hadoop Common, MapReduce, HDFS, and Apache Hadoop YARN. There were various limitations with Apache Hadoop like it fell short for combining other workloads such as machine learning, streaming, or interactive SQL-like queries etc. Also the results of the reduce computations were written to a local disk for subsequent stage of operations. Then came the Spark. Spark provides in-memory storage for intermediate computations, making it much faster than Hadoop MapReduce. It incorporates libraries with composable APIs for machine learning (MLlib), SQL for interactive queries (Spark SQL), stream processing (Structured Streaming) for interacting with real-time data, and graph processing (GraphX).\nSpark Application\nSpark Applications consist of a driver process and a set of executor processes. The driver process runs your main() function, sits on a node in the cluster. The executors are responsible for actually carrying out the work that the driver assigns them. The driver and executors are simply processes, which means that they can live on the same machine or different machines.\nThere is a SparkSession object available to the user, which is the entrance point to running Spark code. When using Spark from Python or R, you don’t write explicit JVM instructions; instead, you write Python and R code that Spark translates into code that it then can run on the executor JVMs.\nSpark’s language APIs make it possible for you to run Spark code using various programming languages like Scala, Java, Python, SQL and R.\nSpark has two fundamental sets of APIs: the low-level “unstructured” APIs (RDDs), and the higher-level structured APIs (Dataframes, Datasets).\nSpark Toolsets\nA DataFrame is the most common Structured API and simply represents a table of data with rows and columns. To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A partition is a collection of rows that sit on one physical machine in your cluster.\nIf a function returns a Dataframe or Dataset or Resilient Distributed Dataset (RDD) then it is a transformation and if it doesn’t return anything then it’s an action. An action instructs Spark to compute a result from a series of transformations. The simplest action is count.\nTransformation are of types narrow and wide. Narrow transformations are those for which each input partition will contribute to only one output partition. Wide transformation will have input partitions contributing to many output partitions.\nSparks performs a lazy evaluation which means that Spark will wait until the very last moment to execute the graph of computation instructions. This provides immense benefits because Spark can optimize the entire data flow from end to end.\nSpark-submit\nReferences\n- https://spark.apache.org/docs/latest/\n- spark: The Definitive Guide by Bill Chambers and Matei Zaharia"}
curl --location --request POST 'https://zylalabs.com/api/5081/content+extract+api/6473/extract+info' --header 'Authorization: Bearer YOUR_API_KEY'
--data-raw '{
"url": "https://techtalkverse.com/post/software-development/spark-basics/"
}'
要使用此端点,请发送一个请求,包含网页的 URL,并接收该页面转换为 markdown 格式的内容
错误 - 端点功能
| 对象 | 描述 |
|---|---|
请求体 |
[必需] Json |
{"response":"---\ntitle: Spark Basics\nurl: https://techtalkverse.com/post/software-development/spark-basics/\nhostname: techtalkverse.com\ndescription: Suppose we have a web application hosted in an application orchestrator like kubernetes. If load in that particular application increases then we can horizontally scale our application simply by increasing the number of pods in our service.\nsitename: techtalkverse.com\ndate: 2023-05-01\ncategories: ['post']\n---\n# Spark Basics\n\nSuppose we have a web application hosted in an application orchestrator like kubernetes. If load in that particular application increases then we can horizontally scale our application simply by increasing the number of pods in our service.\n\nNow let’s suppose there is heavy compute operation happening in each of the pods. Then there will be certain limit upto which these services can run because unlike horizontal scaling where you can have as many numbers of machines as required, there is limit for vertical scaling because you can’t have unlimited ram and cpu cores for each of the machines in a cluster. **Distributed Computing** removes this limitation of vertical scaling by distributing the processing across cluster of machines.\nNow, a group of machines alone is not powerful, you need a framework to\ncoordinate work across them. Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers. The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like Spark’s standalone cluster manager, Kubernetes, YARN, or Mesos.\n\n## Spark Basics\n\nSpark is distributed data processing engine. Distributed data processing in big data is simply series of map and reduce functions which runs across the cluster machines. Given below is python code for calculating the sum of all the even numbers from a given list with the help of map and reduce functions.\n\n```\nfrom functools import reduce\na = [1,2,3,4,5]\nres = reduce(lambda x,y: x+y, (map(lambda x: x if x%2==0 else 0, a)))\n```\n\n\nNow consider, if instead of a simple list, it is a parquet file of size in order of gigabytes. Computation with MapReduce system becomes optimized way of dealing with such problems. In this case spark will load the big parquet file into multiple worker nodes (if the file doesn’t support distributed storage then it will be first loaded into driver node and afterwards, it will get distributed across the worker nodes). Then map function will be executed for each task in each worker node and the final result will fetched with the reduce function.\n\n## Spark timeline\n\nGoogle was first to introduce large scale distributed computing solution with **MapReduce** and its own distributed file system i.e., **Google File System(GFS)**. GFS provided a blueprint for the **Hadoop File System (HDFS)**, including the MapReduce implementation as a framework for distributed computing. **Apache Hadoop** framework was developed consisting of Hadoop Common, MapReduce, HDFS, and Apache Hadoop YARN. There were various limitations with Apache Hadoop like it fell short for combining other workloads such as machine learning, streaming, or interactive SQL-like queries etc. Also the results of the reduce computations were written to a local disk for subsequent stage of operations. Then came the **Spark**. Spark provides in-memory storage for intermediate computations, making it much faster than Hadoop MapReduce. It incorporates libraries with composable APIs for\nmachine learning (MLlib), SQL for interactive queries (Spark SQL), stream processing (Structured Streaming) for interacting with real-time data, and graph processing (GraphX).\n\n## Spark Application\n\n**Spark Applications** consist of a driver process and a set of executor processes. The **driver** process runs your main() function, sits on a node in the cluster. The **executors** are responsible for actually carrying out the work that the driver assigns them. The driver and executors are simply processes, which means that they can live on the same machine or different machines.\n\nThere is a **SparkSession** object available to the user, which is the entrance point to running Spark code. When using Spark from Python or R, you don’t write explicit JVM instructions; instead, you write Python and R code that Spark translates into code that it then can run on the executor JVMs.\n**Spark’s language APIs** make it possible for you to run Spark code using various programming languages like Scala, Java, Python, SQL and R.\nSpark has two fundamental sets of APIs: the **low-level “unstructured” APIs** (RDDs), and the **higher-level structured APIs** (Dataframes, Datasets).\n\n## Spark Toolsets\n\nA **DataFrame** is the most common Structured API and simply represents a table of data with rows and columns. To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A **partition** is a collection of rows that sit on one physical machine in your cluster.\n\nIf a function returns a Dataframe or Dataset or Resilient Distributed Dataset (RDD) then it is a **transformation** and if it doesn’t return anything then it’s an **action**. An action instructs Spark to compute a result from a series of transformations. The simplest action is count.\n\nTransformation are of types narrow and wide. **Narrow transformations** are those for which each input partition will contribute to only one output partition. **Wide transformation** will have input partitions contributing to many output partitions.\n\nSparks performs a **lazy evaluation** which means that Spark will wait until the very last moment to execute the graph of computation instructions. This provides immense benefits because Spark can optimize the entire data flow from end to end.\n\n## Spark-submit\n\n## References\n\n- https://spark.apache.org/docs/latest/\n- spark: The Definitive Guide by Bill Chambers and Matei Zaharia"}
curl --location --request POST 'https://zylalabs.com/api/5081/content+extract+api/6474/exc+marktdown' --header 'Authorization: Bearer YOUR_API_KEY'
--data-raw '{
"url": "https://techtalkverse.com/post/software-development/spark-basics/"
}'
| 标头 | 描述 |
|---|---|
授权
|
[必需] 应为 Bearer access_key. 订阅后,请查看上方的"您的 API 访问密钥"。 |
无长期承诺。随时升级、降级或取消。 免费试用包括最多 50 个请求。
要使用此 API,将网页 URL 发送到相应的端点,并以干净或 markdown 格式接收提取的内容
内容提取API提取并将网页内容转换为干净的文本或Markdown,促进网页数据分析和集成
有不同的计划适合每个人,包括一个小额请求的免费试用,但其速率受到限制以防止滥用服务
Zyla提供几乎所有编程语言的各种集成方法。您可以根据需要使用这些代码与您的项目进行集成
API返回有关域名的年龄和历史的详细信息,包括从创建到现在的年、月和天数,以及到期和更新时间。
“提取信息”端点返回从网页提取的干净文本,而“导出Markdown”端点提供相同内容的Markdown格式。两个端点都专注于提供用于分析或集成的结构化可读内容
响应数据通常包括在“提取信息”端点提取的内容作为一个文本块,以及在“提取Markdown”端点的Markdown格式字符串。根据实现,可能会包含其他元数据
响应数据结构为JSON对象,包含提取的内容作为键值对。例如,键可能是"content",对应的值是干净的文本或markdown
两个端点的主要参数是要提取内容的网页的 "url" 用户可以通过提供不同的 URL 来定制他们的请求以针对特定网页
每个端点提供来自网页的文本内容,专注于主要文本部分,同时过滤掉广告、菜单和其他非必要元素。这确保用户获得与他们需求相关的信息
用户可以将返回的干净文本或Markdown集成到应用程序中,用于内容生成、分析或文档编写。例如,Markdown可以直接在博客平台中使用,而干净文本可以用于分析洞察
常见的使用案例包括博客内容生成 市场调研数据收集 自动新闻摘要 技术文档创建 和 SEO 分析 每个用例都利用了 API 提取和格式化网页内容的能力
该API采用算法过滤无关内容 确保提取的文本干净而有意义 对提取过程的持续更新和改进有助于保持高数据质量和相关性
服务级别:
100%
响应时间:
758ms
服务级别:
100%
响应时间:
805ms
服务级别:
100%
响应时间:
1,408ms
服务级别:
100%
响应时间:
880ms
服务级别:
100%
响应时间:
884ms
服务级别:
100%
响应时间:
1,586ms
服务级别:
100%
响应时间:
1,374ms
服务级别:
100%
响应时间:
2,560ms
服务级别:
100%
响应时间:
811ms
服务级别:
100%
响应时间:
463ms