Best practice for Erlang's process design to build a website-downloader (super simple crawler)

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Best practice for Erlang's process design to build a website-downloader (super simple crawler)

I Gusti Ngurah Oka Prinarjaya
Hi,

I need to know the best practice for Erlang's process design to become a website downloader. I don't need heavy parsing the website like what a scrapper do. Maybe i only need to parse url's <a href=".." /> . 

What had just come to my mind was create N number of Erlang's process under a supervisor. N is number of url <a href="..." /> found in a website's pages. But i'm not sure that's a good design. So i need recommendations from you who have experience on it. 

Thank you, I appreciate all of  your time and attention



Reply | Threaded
Open this post in threaded view
|

Re: Best practice for Erlang's process design to build a website-downloader (super simple crawler)

I Gusti Ngurah Oka Prinarjaya
Hi,

Anyone? 



Pada tanggal Sab, 9 Nov 2019 pukul 19.43 I Gusti Ngurah Oka Prinarjaya <[hidden email]> menulis:
Hi,

I need to know the best practice for Erlang's process design to become a website downloader. I don't need heavy parsing the website like what a scrapper do. Maybe i only need to parse url's <a href=".." /> . 

What had just come to my mind was create N number of Erlang's process under a supervisor. N is number of url <a href="..." /> found in a website's pages. But i'm not sure that's a good design. So i need recommendations from you who have experience on it. 

Thank you, I appreciate all of  your time and attention



Reply | Threaded
Open this post in threaded view
|

Re: Best practice for Erlang's process design to build a website-downloader (super simple crawler)

Grzegorz Junka

Hi Gusti,

I would suggest to create a pool of N processes and a queue of URLs to process. Every time a new URL is encountered it's added to the queue. Then a scheduler would pick up those URLs and distribute them across the pool of processes. I would not suggest to create a new process for each URL unless you can be sure it doesn't leant to an explosion of processes, i.e. that the number of URLs is limited.

Greg


On 10/11/2019 10:07, I Gusti Ngurah Oka Prinarjaya wrote:
Hi,

Anyone? 



Pada tanggal Sab, 9 Nov 2019 pukul 19.43 I Gusti Ngurah Oka Prinarjaya <[hidden email]> menulis:
Hi,

I need to know the best practice for Erlang's process design to become a website downloader. I don't need heavy parsing the website like what a scrapper do. Maybe i only need to parse url's <a href=".." /> . 

What had just come to my mind was create N number of Erlang's process under a supervisor. N is number of url <a href="..." /> found in a website's pages. But i'm not sure that's a good design. So i need recommendations from you who have experience on it. 

Thank you, I appreciate all of  your time and attention



Reply | Threaded
Open this post in threaded view
|

Re: Best practice for Erlang's process design to build a website-downloader (super simple crawler)

I Gusti Ngurah Oka Prinarjaya
Hi Grzegorz,

Thank you for your suggestion. 

>> I would not suggest to create a new process for each URL unless you can be sure it doesn't leant to an explosion of processes
Thanks for reminding me



Pada tanggal Sen, 11 Nov 2019 pukul 00.12 Grzegorz Junka <[hidden email]> menulis:

Hi Gusti,

I would suggest to create a pool of N processes and a queue of URLs to process. Every time a new URL is encountered it's added to the queue. Then a scheduler would pick up those URLs and distribute them across the pool of processes. I would not suggest to create a new process for each URL unless you can be sure it doesn't leant to an explosion of processes, i.e. that the number of URLs is limited.

Greg


On 10/11/2019 10:07, I Gusti Ngurah Oka Prinarjaya wrote:
Hi,

Anyone? 



Pada tanggal Sab, 9 Nov 2019 pukul 19.43 I Gusti Ngurah Oka Prinarjaya <[hidden email]> menulis:
Hi,

I need to know the best practice for Erlang's process design to become a website downloader. I don't need heavy parsing the website like what a scrapper do. Maybe i only need to parse url's <a href=".." /> . 

What had just come to my mind was create N number of Erlang's process under a supervisor. N is number of url <a href="..." /> found in a website's pages. But i'm not sure that's a good design. So i need recommendations from you who have experience on it. 

Thank you, I appreciate all of  your time and attention



Reply | Threaded
Open this post in threaded view
|

RE: Best practice for Erlang's process design to build a website-downloader (super simple crawler)

Сергей Прохоров-2
In reply to this post by I Gusti Ngurah Oka Prinarjaya
Hi,

I have quite some years of experience writing web scrapers in Erlang. The design that I came to over time is following:

I have a top-level supervisor with following 3 processes:

- workers_supervisor (queue_name)
- queue (queue_name)
- scheduler (queue_name)

All of them are registered based on queue name to make it possible to have multiple sets of spiders at the same time, but it's not essential if you only need one.
  • workers supervisor is just `simple_one_for_one` supervisor that supervises worker processes.
  • queue - gen_server. How it operates does depend on the type of the spider; if it is recursive spider (downloads all the pages of a website) - this process holds:
    • a queue of urls to download (regular `queue:queue()`),
    • `ets` table that holds a set of URLs that were ever added to the queue (to avoid downloading the same link more than once: queue process only adds new URLs to the queue if it is NOT in this ETS),
    • a dictionary / map of tasks currently in progress (taken by worker but not yet finished) as a map `#{<worker pid monitoring reference> => task()}` - if worker crashes, this task can be re-schedulled.
    • list of worker pid's subscribed to this queue (maybe monitored).
    • It may also contain some set of rules to exclude some pages (eg, based on robots.txt).
    • You should also have URL normalisation function (eg, to threat absolute and relative URLs as the same URL; should decide if `?filter=wasd&page=2` is the same as `?page=2&filter=wasd`, strip URL hashes `#footer` etc). It has 2 APIs: `subscribe`

      queue has quite simple API: `push` / `push_many`, `subscribe` and `ack`. Workers gen-servers call `subscribe` and wait for a task message (it contains URL and unique reference). When task is done - they call `ack(Reference)` and are ready to get next task.
  • scheduler: it's basically an entry point and the only "brain" of your spider. It takes in the tasks from whatever you want to take them (PUB/SUB queues / CRON / HTTP api, put the "seed" URLs to the queue and spawns workers (usually at start time by calling "workers supervisor" API and I only used to have fixed number of workers to avoid overloading the website or crawler); it can also monitor queue size progress; workers may report to scheduler when task is taken/done; highly depends on your needs actually.
  • and of course workers: gen_servers, they are supervised by "workers supervisor", their start is initiated by scheduler (or might be just fixed at app start time actually). At start time they call `queue:subscribe` and just wait for messages from the queue. When message is received, it downloads the page, parses it, pushes all found URLs to queue (queue decides which URLs to accept and which to ignore) and saves the results to database; calls `queue:ack` in the end and waits for next task.
    There is a choice - let the worker crash on errors or have a top-level "try - catch". I prefer to catch to not spam erlang's crash logs, but it depends on your requirements and expected error rates.
This structure proved to be very flexible and allows not only recursive crawlers but some other kinds of crawlers, eg non-recursive that do take their entire URL set from external source and just downloading what they were asked for and saving to DB (in this case scheduler fetches URLs from the task source and puts them to the queue; queue doesn't have duplicate filter).
Queue can have namespaces in case you want to parse some website morethan once and sometimes in parallel: for each task you use taks_id as a namespace, so duplicate filter discards URLs based on {taks_id, URL} pair.

Hope this will help a bit.
Reply | Threaded
Open this post in threaded view
|

Re: Best practice for Erlang's process design to build a website-downloader (super simple crawler)

I Gusti Ngurah Oka Prinarjaya
Hi,

Wowww.. thank you very very much for sharing your experience and strategy with me. I do really appreciate it. 

Ok, l'll start now to write my own website-crawler.


Thank you




Pada tanggal Sen, 11 Nov 2019 pukul 18.06 Сергей Прохоров <[hidden email]> menulis:
Hi,

I have quite some years of experience writing web scrapers in Erlang. The design that I came to over time is following:

I have a top-level supervisor with following 3 processes:

- workers_supervisor (queue_name)
- queue (queue_name)
- scheduler (queue_name)

All of them are registered based on queue name to make it possible to have multiple sets of spiders at the same time, but it's not essential if you only need one.
  • workers supervisor is just `simple_one_for_one` supervisor that supervises worker processes.
  • queue - gen_server. How it operates does depend on the type of the spider; if it is recursive spider (downloads all the pages of a website) - this process holds:
    • a queue of urls to download (regular `queue:queue()`),
    • `ets` table that holds a set of URLs that were ever added to the queue (to avoid downloading the same link more than once: queue process only adds new URLs to the queue if it is NOT in this ETS),
    • a dictionary / map of tasks currently in progress (taken by worker but not yet finished) as a map `#{<worker pid monitoring reference> => task()}` - if worker crashes, this task can be re-schedulled.
    • list of worker pid's subscribed to this queue (maybe monitored).
    • It may also contain some set of rules to exclude some pages (eg, based on robots.txt).
    • You should also have URL normalisation function (eg, to threat absolute and relative URLs as the same URL; should decide if `?filter=wasd&page=2` is the same as `?page=2&filter=wasd`, strip URL hashes `#footer` etc). It has 2 APIs: `subscribe`

      queue has quite simple API: `push` / `push_many`, `subscribe` and `ack`. Workers gen-servers call `subscribe` and wait for a task message (it contains URL and unique reference). When task is done - they call `ack(Reference)` and are ready to get next task.
  • scheduler: it's basically an entry point and the only "brain" of your spider. It takes in the tasks from whatever you want to take them (PUB/SUB queues / CRON / HTTP api, put the "seed" URLs to the queue and spawns workers (usually at start time by calling "workers supervisor" API and I only used to have fixed number of workers to avoid overloading the website or crawler); it can also monitor queue size progress; workers may report to scheduler when task is taken/done; highly depends on your needs actually.
  • and of course workers: gen_servers, they are supervised by "workers supervisor", their start is initiated by scheduler (or might be just fixed at app start time actually). At start time they call `queue:subscribe` and just wait for messages from the queue. When message is received, it downloads the page, parses it, pushes all found URLs to queue (queue decides which URLs to accept and which to ignore) and saves the results to database; calls `queue:ack` in the end and waits for next task.
    There is a choice - let the worker crash on errors or have a top-level "try - catch". I prefer to catch to not spam erlang's crash logs, but it depends on your requirements and expected error rates.
This structure proved to be very flexible and allows not only recursive crawlers but some other kinds of crawlers, eg non-recursive that do take their entire URL set from external source and just downloading what they were asked for and saving to DB (in this case scheduler fetches URLs from the task source and puts them to the queue; queue doesn't have duplicate filter).
Queue can have namespaces in case you want to parse some website morethan once and sometimes in parallel: for each task you use taks_id as a namespace, so duplicate filter discards URLs based on {taks_id, URL} pair.

Hope this will help a bit.