A API de Extração de Texto Estruturado oferece uma solução simples para extrair e converter o conteúdo de páginas da web. Seu endpoint de texto limpo fornece conteúdo legível e sem formatação, perfeito para análise de texto ou apresentação simplificada. O endpoint de markdown vai um passo além, gerando markdown estruturado ideal para integração com ferramentas ou sistemas compatíveis com markdown. Com suporte para diversos tipos de páginas da web, a API garante desempenho confiável e adaptabilidade para aplicações variadas, tornando-a indispensável para a análise e transformação de conteúdo
Para usar este ponto de extremidade, envie uma solicitação com a URL da página da web e receba o texto limpo extraído do conteúdo dessa página
Markdownify - Recursos do endpoint
| Objeto | Descrição |
|---|---|
Corpo da requisição |
[Obrigatório] Json |
{"response":"Spark Basics\nSuppose we have a web application hosted in an application orchestrator like kubernetes. If load in that particular application increases then we can horizontally scale our application simply by increasing the number of pods in our service.\nNow let’s suppose there is heavy compute operation happening in each of the pods. Then there will be certain limit upto which these services can run because unlike horizontal scaling where you can have as many numbers of machines as required, there is limit for vertical scaling because you can’t have unlimited ram and cpu cores for each of the machines in a cluster. Distributed Computing removes this limitation of vertical scaling by distributing the processing across cluster of machines. Now, a group of machines alone is not powerful, you need a framework to coordinate work across them. Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers. The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like Spark’s standalone cluster manager, Kubernetes, YARN, or Mesos.\nSpark Basics\nSpark is distributed data processing engine. Distributed data processing in big data is simply series of map and reduce functions which runs across the cluster machines. Given below is python code for calculating the sum of all the even numbers from a given list with the help of map and reduce functions.\nfrom functools import reduce\na = [1,2,3,4,5]\nres = reduce(lambda x,y: x+y, (map(lambda x: x if x%2==0 else 0, a)))\nNow consider, if instead of a simple list, it is a parquet file of size in order of gigabytes. Computation with MapReduce system becomes optimized way of dealing with such problems. In this case spark will load the big parquet file into multiple worker nodes (if the file doesn’t support distributed storage then it will be first loaded into driver node and afterwards, it will get distributed across the worker nodes). Then map function will be executed for each task in each worker node and the final result will fetched with the reduce function.\nSpark timeline\nGoogle was first to introduce large scale distributed computing solution with MapReduce and its own distributed file system i.e., Google File System(GFS). GFS provided a blueprint for the Hadoop File System (HDFS), including the MapReduce implementation as a framework for distributed computing. Apache Hadoop framework was developed consisting of Hadoop Common, MapReduce, HDFS, and Apache Hadoop YARN. There were various limitations with Apache Hadoop like it fell short for combining other workloads such as machine learning, streaming, or interactive SQL-like queries etc. Also the results of the reduce computations were written to a local disk for subsequent stage of operations. Then came the Spark. Spark provides in-memory storage for intermediate computations, making it much faster than Hadoop MapReduce. It incorporates libraries with composable APIs for machine learning (MLlib), SQL for interactive queries (Spark SQL), stream processing (Structured Streaming) for interacting with real-time data, and graph processing (GraphX).\nSpark Application\nSpark Applications consist of a driver process and a set of executor processes. The driver process runs your main() function, sits on a node in the cluster. The executors are responsible for actually carrying out the work that the driver assigns them. The driver and executors are simply processes, which means that they can live on the same machine or different machines.\nThere is a SparkSession object available to the user, which is the entrance point to running Spark code. When using Spark from Python or R, you don’t write explicit JVM instructions; instead, you write Python and R code that Spark translates into code that it then can run on the executor JVMs.\nSpark’s language APIs make it possible for you to run Spark code using various programming languages like Scala, Java, Python, SQL and R.\nSpark has two fundamental sets of APIs: the low-level “unstructured” APIs (RDDs), and the higher-level structured APIs (Dataframes, Datasets).\nSpark Toolsets\nA DataFrame is the most common Structured API and simply represents a table of data with rows and columns. To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A partition is a collection of rows that sit on one physical machine in your cluster.\nIf a function returns a Dataframe or Dataset or Resilient Distributed Dataset (RDD) then it is a transformation and if it doesn’t return anything then it’s an action. An action instructs Spark to compute a result from a series of transformations. The simplest action is count.\nTransformation are of types narrow and wide. Narrow transformations are those for which each input partition will contribute to only one output partition. Wide transformation will have input partitions contributing to many output partitions.\nSparks performs a lazy evaluation which means that Spark will wait until the very last moment to execute the graph of computation instructions. This provides immense benefits because Spark can optimize the entire data flow from end to end.\nSpark-submit\nReferences\n- https://spark.apache.org/docs/latest/\n- spark: The Definitive Guide by Bill Chambers and Matei Zaharia"}
curl --location --request POST 'https://zylalabs.com/api/5662/structured+text+extractor+api/7373/markdownify+api' --header 'Authorization: Bearer YOUR_API_KEY'
--data-raw '{
"url": "https://techtalkverse.com/post/software-development/spark-basics/"
}'
Para usar este ponto de extremidade, envie uma solicitação com a URL da página da web e receba o conteúdo convertido para o formato markdown dessa página
Conversor de Página Limpa - Recursos do endpoint
| Objeto | Descrição |
|---|---|
Corpo da requisição |
[Obrigatório] Json |
{"response":"---\ntitle: Spark Basics\nurl: https://techtalkverse.com/post/software-development/spark-basics/\nhostname: techtalkverse.com\ndescription: Suppose we have a web application hosted in an application orchestrator like kubernetes. If load in that particular application increases then we can horizontally scale our application simply by increasing the number of pods in our service.\nsitename: techtalkverse.com\ndate: 2023-05-01\ncategories: ['post']\n---\n# Spark Basics\n\nSuppose we have a web application hosted in an application orchestrator like kubernetes. If load in that particular application increases then we can horizontally scale our application simply by increasing the number of pods in our service.\n\nNow let’s suppose there is heavy compute operation happening in each of the pods. Then there will be certain limit upto which these services can run because unlike horizontal scaling where you can have as many numbers of machines as required, there is limit for vertical scaling because you can’t have unlimited ram and cpu cores for each of the machines in a cluster. **Distributed Computing** removes this limitation of vertical scaling by distributing the processing across cluster of machines.\nNow, a group of machines alone is not powerful, you need a framework to\ncoordinate work across them. Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers. The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like Spark’s standalone cluster manager, Kubernetes, YARN, or Mesos.\n\n## Spark Basics\n\nSpark is distributed data processing engine. Distributed data processing in big data is simply series of map and reduce functions which runs across the cluster machines. Given below is python code for calculating the sum of all the even numbers from a given list with the help of map and reduce functions.\n\n```\nfrom functools import reduce\na = [1,2,3,4,5]\nres = reduce(lambda x,y: x+y, (map(lambda x: x if x%2==0 else 0, a)))\n```\n\n\nNow consider, if instead of a simple list, it is a parquet file of size in order of gigabytes. Computation with MapReduce system becomes optimized way of dealing with such problems. In this case spark will load the big parquet file into multiple worker nodes (if the file doesn’t support distributed storage then it will be first loaded into driver node and afterwards, it will get distributed across the worker nodes). Then map function will be executed for each task in each worker node and the final result will fetched with the reduce function.\n\n## Spark timeline\n\nGoogle was first to introduce large scale distributed computing solution with **MapReduce** and its own distributed file system i.e., **Google File System(GFS)**. GFS provided a blueprint for the **Hadoop File System (HDFS)**, including the MapReduce implementation as a framework for distributed computing. **Apache Hadoop** framework was developed consisting of Hadoop Common, MapReduce, HDFS, and Apache Hadoop YARN. There were various limitations with Apache Hadoop like it fell short for combining other workloads such as machine learning, streaming, or interactive SQL-like queries etc. Also the results of the reduce computations were written to a local disk for subsequent stage of operations. Then came the **Spark**. Spark provides in-memory storage for intermediate computations, making it much faster than Hadoop MapReduce. It incorporates libraries with composable APIs for\nmachine learning (MLlib), SQL for interactive queries (Spark SQL), stream processing (Structured Streaming) for interacting with real-time data, and graph processing (GraphX).\n\n## Spark Application\n\n**Spark Applications** consist of a driver process and a set of executor processes. The **driver** process runs your main() function, sits on a node in the cluster. The **executors** are responsible for actually carrying out the work that the driver assigns them. The driver and executors are simply processes, which means that they can live on the same machine or different machines.\n\nThere is a **SparkSession** object available to the user, which is the entrance point to running Spark code. When using Spark from Python or R, you don’t write explicit JVM instructions; instead, you write Python and R code that Spark translates into code that it then can run on the executor JVMs.\n**Spark’s language APIs** make it possible for you to run Spark code using various programming languages like Scala, Java, Python, SQL and R.\nSpark has two fundamental sets of APIs: the **low-level “unstructured” APIs** (RDDs), and the **higher-level structured APIs** (Dataframes, Datasets).\n\n## Spark Toolsets\n\nA **DataFrame** is the most common Structured API and simply represents a table of data with rows and columns. To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A **partition** is a collection of rows that sit on one physical machine in your cluster.\n\nIf a function returns a Dataframe or Dataset or Resilient Distributed Dataset (RDD) then it is a **transformation** and if it doesn’t return anything then it’s an **action**. An action instructs Spark to compute a result from a series of transformations. The simplest action is count.\n\nTransformation are of types narrow and wide. **Narrow transformations** are those for which each input partition will contribute to only one output partition. **Wide transformation** will have input partitions contributing to many output partitions.\n\nSparks performs a **lazy evaluation** which means that Spark will wait until the very last moment to execute the graph of computation instructions. This provides immense benefits because Spark can optimize the entire data flow from end to end.\n\n## Spark-submit\n\n## References\n\n- https://spark.apache.org/docs/latest/\n- spark: The Definitive Guide by Bill Chambers and Matei Zaharia"}
curl --location --request POST 'https://zylalabs.com/api/5662/structured+text+extractor+api/7374/clean+page+converter' --header 'Authorization: Bearer YOUR_API_KEY'
--data-raw '{
"url": "https://techtalkverse.com/post/software-development/spark-basics/"
}'
| Cabeçalho | Descrição |
|---|---|
Authorization
|
[Obrigatório] Deve ser Bearer access_key. Veja "Sua chave de acesso à API" acima quando você estiver inscrito. |
Sem compromisso de longo prazo. Faça upgrade, downgrade ou cancele a qualquer momento. O teste gratuito inclui até 50 requisições.
A API Extrator de Texto Estruturado é uma ferramenta projetada para extrair texto limpo ou markdown de qualquer página da web, simplificando a manipulação de conteúdo para várias aplicações
A API pode extrair conteúdo não formatado e legível, bem como markdown estruturado, tornando-a adequada para análise de texto e integração com ferramentas compatíveis com markdown
O endpoint de texto limpo fornece conteúdo não formatado de uma página da web garantindo que o texto extraído seja legível e adequado para análise ou apresentação sem quaisquer tags HTML
O endpoint de markdown gera markdown estruturado que é ideal para usuários que precisam integrar o conteúdo extraído com sistemas compatíveis com markdown melhorando a usabilidade e a formatação
Sim a API Structured Text Extractor suporta uma ampla variedade de tipos de páginas da web garantindo desempenho confiável e adaptabilidade para diversas necessidades de análise e transformação de conteúdo
O endpoint de texto limpo retorna texto legível e não formatado extraído de uma página da web enquanto o endpoint markdown retorna markdown estruturado incluindo metadados como título URL e descrição junto com o conteúdo formatado
A resposta de texto limpo contém o texto extraído enquanto a resposta em markdown inclui campos como título URL nome do host descrição data categorias e o conteúdo principal formatado em markdown
A resposta em texto limpo é uma simples string de texto enquanto a resposta em markdown é estruturada como um objeto JSON com pares de chave-valor permitindo fácil acesso a metadados e conteúdo específicos
O endpoint de texto limpo fornece texto simples para análise enquanto o endpoint de markdown oferece conteúdo detalhado com metadados tornando-o adequado para documentação blogs e sistemas de gerenciamento de conteúdo
O parâmetro principal para ambos os pontos finais é a URL da página da web da qual o conteúdo deve ser extraído Os usuários podem personalizar os pedidos fornecendo diferentes URLs para direcionar conteúdo específico
Os usuários podem analisar o texto limpo em busca de insights ou usar a saída em markdown para criar documentos formatados integrar com sistemas de gerenciamento de conteúdo ou melhorar aplicativos web que suportam markdown
A API emprega algoritmos de análise robustos para extrair conteúdo com precisão de páginas da web, garantindo que o texto e o markdown retornados reflitam o conteúdo original o mais próximo possível
Os casos de uso mais comuns incluem agregação de conteúdo, análise de texto, criação de posts de blog e transformação de conteúdo da web em markdown para documentação ou publicação em plataformas compatíveis com markdown
Nível de serviço:
100%
Tempo de resposta:
1.945ms
Nível de serviço:
100%
Tempo de resposta:
1.971ms
Nível de serviço:
100%
Tempo de resposta:
650ms
Nível de serviço:
100%
Tempo de resposta:
3.168ms
Nível de serviço:
100%
Tempo de resposta:
1.429ms
Nível de serviço:
100%
Tempo de resposta:
2.560ms
Nível de serviço:
100%
Tempo de resposta:
1.812ms
Nível de serviço:
100%
Tempo de resposta:
4.649ms
Nível de serviço:
91%
Tempo de resposta:
2.513ms
Nível de serviço:
100%
Tempo de resposta:
1.113ms
Nível de serviço:
100%
Tempo de resposta:
20.002ms
Nível de serviço:
100%
Tempo de resposta:
268ms
Nível de serviço:
100%
Tempo de resposta:
243ms
Nível de serviço:
100%
Tempo de resposta:
2.601ms
Nível de serviço:
100%
Tempo de resposta:
1.003ms
Nível de serviço:
100%
Tempo de resposta:
20.002ms
Nível de serviço:
100%
Tempo de resposta:
1.585ms
Nível de serviço:
100%
Tempo de resposta:
431ms
Nível de serviço:
100%
Tempo de resposta:
213ms
Nível de serviço:
100%
Tempo de resposta:
6.980ms