"actor database" - architectural strategy question

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Miles Fidelman-2
[Enough with the threads on Erlang angst for a while - time for some
real questions :-) ]

BACKGROUND:
A lot of what I do is systems engineering, and a lot of that ends up in
the realm of technology assessment - picking the right platform and
tools for a particular system.  My dablings in Erlang are largely in
that category - I keep seeing it as potentially useful for a class of
systems, keep experimenting with it, done a couple proof-of-concept
efforts, but haven't built an operational system at scale with it
(yet).  The focus, so far, has been in modeling and simulation (I first
discovered Erlang when chasing R&D contracts for a firm that built
simulation engines for military trainers.  I was flabbergasted to
discover that everything was written in C++, every simulated entity was
an object, with 4 main loops threading through every object, 20 times a
second.  Talk about spaghetti code.  Coming from a data comm.
protocol/network background - where we'd spawn a process for everything
- I asked the obvious question, and was told that context switches would
bring a 10,000 entity simulation to its knees.  My instinctual response
was "bullshit" - and went digging into the technology for massive
concurrency, and discovered Erlang.)

Anyway....  For years, I've been finding myself in situations, and on
projects, that have a common characteristic of linked documents that
change a lot - in the general arena of planning and workflow. Lots of
people, each editing different parts of different documents - with
changes rippling through the collection.  Think linked spreadsheets,
tiered project plans, multi-level engineering documents with lots of
inter-dependencies.  To be more concrete: systems engineering documents,
large proposals, business planning systems, command and control systems.

Add in requirements for disconnected operation that lead to
distribution/replication requirements rather than keeping single,
central copies of things (as the librarians like to say, "Lots of Copies
Keeps Stuff Safe").

So far we've always taken conventional approaches - ranging from manual
paper shuffling and xeroxing, to file servers with manual organization,
to some of MS Office's document linking capabilities, to document
databases and sharepoint.  And played with some XML database technologies.

But.... I keep thinking that there are a set of underlying functions
that beg for better tools - something like a distributed CVS that's
optimized for planning documents rather than software (or perhaps
something like a modernized Lotus Notes).

And I keep thinking that the obvious architectural model is to treat
each document (maybe each page) as an actor ("smart documents" if you
will), with communication through publish-subscribe mechanisms. Interact
with a (copy of) a document, changes get pushed to groups of documents
via a pub-sub mechanism.  (Not unlike actor based simulation approaches.)

And, of course, when I think actors, I think Erlang.  The obvious
conceptualization is "every document is an actor."

At which point an obvious question comes up:  How to handle long-term
persistence, for large numbers of inactive entities.

But... when I go looking for examples of systems that might be built
this way, I keep finding that, even in Erlang-based systems, persistence
is handled in fairly conventional ways:
- One might think that CouchDB treats every document as an actor, but
think again
- Paulo Negri has given some great presentations on how Wooga implements
large-scale social gaming - and they implement an actor per session -
but when a user goes off-line they push state into a more conventional
database  (then initialize a gen_server from the database, when the user
comes back online)

At which point the phrase "actor-oriented database" keeps coming back to
mind, with the obvious analogy to "object-oriented databases."  I.e.,
something with the persistence and other characteristics of a database,
where the contents are actors - with all the characteristics and
functionality of those actors preserved while stored in the database.

ON TO THE QUESTIONS:
I have a pretty good understanding of how one would build things like
simulations, or protocol servers, with Erlang - not so much how one
might build something with long-term persistence - which leads to some
questions (some, probably naive):

1. So far, I haven't seen anything that actually looks like an
"actor-oriented database."  Document databases implemented in Erlang,
yes (e.g., CouchDB), but every example I find ultimately pushes
persistent data into files or a more conventional database of some
sort.  Can anybody point to an example of something that looks more like
"storing actors in a database?"
- It strikes me that the core issues with doing so have to do with
maintaining "aliveness" - i.e., dealing with addressability, routing
messages to a stored actor, waking up after a timeout (i.e., the
equivalent of triggers)

2. One obvious (if simplistic) thought: Does one really need to think in
terms of a "database" at all - or might this problem be approached
simply by creating each document as an Erlang process, and keeping it
around forever?  Most of what I've seen built in Erlang focuses on
relatively short-lived actors - I'd be really interested in comments on:
- limitations/issues in persisting 100s of 1000s, or maybe millions of
actors, for extended periods of time (years, or decades)
- are there any tools/models for migrating (swapping?) inactive
processes dynamically to/from disk storage

3. What about backup for the state of a process?  'Let it crash' is
great for servers supporting a reliable protocol, not so great for an
actor that has  internal state that has to be preserved (like a
simulated tank, or a "smart document"). Pushing into a database is
obvious, but...
- are there any good models for saving/restoring state within a tree of
supervised processes?
- what about models for synchronizing state across replicated copies of
processes running on different nodes?
- what about backup/restore of entire Erlang VMs (including anything
that might be swapped out onto disk)

4. For communications between/among actors:  Erlang is obviously
excellent for writing pub-sub engines (RabbitMQ and ejabberd come to
mind), but what about pub-sub or multicast/broadcast models or messaging
between Erlang processes?  Are there any good libraries for
defining/managing process groups, and doing multicast or broadcast
messaging to/among a group of processes.

Thank you very much for any pointers or thoughts.

Miles Fidelman




--
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra


Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Jesper Louis Andersen
A document is a trace of events. These events records edits to the document
and when we play all of the events, we obtain the final document state.
Infinite undo is possible by looking back and replaying with a
point-in-time recovery option. An actor is a handler that can apply events
to a state in order to obtain a new state.

Events are persisted in an event log and WAL fashion. So even if the system
dies, we can replay its state safely. Once in a while, living processes
checkpoint their state to disk so they can boot up faster than having to
replay from day 0.

Multiple edits to the same document can be handled by operational transforms

http://en.wikipedia.org/wiki/Operational_transformation

Idle documents terminate themselves after a while by checkpointing
themselves to disk. Documents register themselves into gproc and if there
is no document present in gproc, you go to a manager and get it set up
either from disk or by forming a new document.

For easy storage, you can use a single table in a database for the log.


On Mon, Feb 17, 2014 at 3:20 PM, Miles Fidelman
<mfidelman>wrote:

> [Enough with the threads on Erlang angst for a while - time for some real
> questions :-) ]
>
> BACKGROUND:
> A lot of what I do is systems engineering, and a lot of that ends up in
> the realm of technology assessment - picking the right platform and tools
> for a particular system.  My dablings in Erlang are largely in that
> category - I keep seeing it as potentially useful for a class of systems,
> keep experimenting with it, done a couple proof-of-concept efforts, but
> haven't built an operational system at scale with it (yet).  The focus, so
> far, has been in modeling and simulation (I first discovered Erlang when
> chasing R&D contracts for a firm that built simulation engines for military
> trainers.  I was flabbergasted to discover that everything was written in
> C++, every simulated entity was an object, with 4 main loops threading
> through every object, 20 times a second.  Talk about spaghetti code.
>  Coming from a data comm. protocol/network background - where we'd spawn a
> process for everything - I asked the obvious question, and was told that
> context switches would bring a 10,000 entity simulation to its knees.  My
> instinctual response was "bullshit" - and went digging into the technology
> for massive concurrency, and discovered Erlang.)
>
> Anyway....  For years, I've been finding myself in situations, and on
> projects, that have a common characteristic of linked documents that change
> a lot - in the general arena of planning and workflow. Lots of people, each
> editing different parts of different documents - with changes rippling
> through the collection.  Think linked spreadsheets, tiered project plans,
> multi-level engineering documents with lots of inter-dependencies.  To be
> more concrete: systems engineering documents, large proposals, business
> planning systems, command and control systems.
>
> Add in requirements for disconnected operation that lead to
> distribution/replication requirements rather than keeping single, central
> copies of things (as the librarians like to say, "Lots of Copies Keeps
> Stuff Safe").
>
> So far we've always taken conventional approaches - ranging from manual
> paper shuffling and xeroxing, to file servers with manual organization, to
> some of MS Office's document linking capabilities, to document databases
> and sharepoint.  And played with some XML database technologies.
>
> But.... I keep thinking that there are a set of underlying functions that
> beg for better tools - something like a distributed CVS that's optimized
> for planning documents rather than software (or perhaps something like a
> modernized Lotus Notes).
>
> And I keep thinking that the obvious architectural model is to treat each
> document (maybe each page) as an actor ("smart documents" if you will),
> with communication through publish-subscribe mechanisms. Interact with a
> (copy of) a document, changes get pushed to groups of documents via a
> pub-sub mechanism.  (Not unlike actor based simulation approaches.)
>
> And, of course, when I think actors, I think Erlang.  The obvious
> conceptualization is "every document is an actor."
>
> At which point an obvious question comes up:  How to handle long-term
> persistence, for large numbers of inactive entities.
>
> But... when I go looking for examples of systems that might be built this
> way, I keep finding that, even in Erlang-based systems, persistence is
> handled in fairly conventional ways:
> - One might think that CouchDB treats every document as an actor, but
> think again
> - Paulo Negri has given some great presentations on how Wooga implements
> large-scale social gaming - and they implement an actor per session - but
> when a user goes off-line they push state into a more conventional database
>  (then initialize a gen_server from the database, when the user comes back
> online)
>
> At which point the phrase "actor-oriented database" keeps coming back to
> mind, with the obvious analogy to "object-oriented databases."  I.e.,
> something with the persistence and other characteristics of a database,
> where the contents are actors - with all the characteristics and
> functionality of those actors preserved while stored in the database.
>
> ON TO THE QUESTIONS:
> I have a pretty good understanding of how one would build things like
> simulations, or protocol servers, with Erlang - not so much how one might
> build something with long-term persistence - which leads to some questions
> (some, probably naive):
>
> 1. So far, I haven't seen anything that actually looks like an
> "actor-oriented database."  Document databases implemented in Erlang, yes
> (e.g., CouchDB), but every example I find ultimately pushes persistent data
> into files or a more conventional database of some sort.  Can anybody point
> to an example of something that looks more like "storing actors in a
> database?"
> - It strikes me that the core issues with doing so have to do with
> maintaining "aliveness" - i.e., dealing with addressability, routing
> messages to a stored actor, waking up after a timeout (i.e., the equivalent
> of triggers)
>
> 2. One obvious (if simplistic) thought: Does one really need to think in
> terms of a "database" at all - or might this problem be approached simply
> by creating each document as an Erlang process, and keeping it around
> forever?  Most of what I've seen built in Erlang focuses on relatively
> short-lived actors - I'd be really interested in comments on:
> - limitations/issues in persisting 100s of 1000s, or maybe millions of
> actors, for extended periods of time (years, or decades)
> - are there any tools/models for migrating (swapping?) inactive processes
> dynamically to/from disk storage
>
> 3. What about backup for the state of a process?  'Let it crash' is great
> for servers supporting a reliable protocol, not so great for an actor that
> has  internal state that has to be preserved (like a simulated tank, or a
> "smart document"). Pushing into a database is obvious, but...
> - are there any good models for saving/restoring state within a tree of
> supervised processes?
> - what about models for synchronizing state across replicated copies of
> processes running on different nodes?
> - what about backup/restore of entire Erlang VMs (including anything that
> might be swapped out onto disk)
>
> 4. For communications between/among actors:  Erlang is obviously excellent
> for writing pub-sub engines (RabbitMQ and ejabberd come to mind), but what
> about pub-sub or multicast/broadcast models or messaging between Erlang
> processes?  Are there any good libraries for defining/managing process
> groups, and doing multicast or broadcast messaging to/among a group of
> processes.
>
> Thank you very much for any pointers or thoughts.
>
> Miles Fidelman
>
>
>
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is.   .... Yogi Berra
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions
>



--
J.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140217/8eba1ac3/attachment.html>

Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Sergej Jurecko
In reply to this post by Miles Fidelman-2
http://www.actordb.com/


Sergej

On Feb 17, 2014, at 3:20 PM, Miles Fidelman wrote:

> [Enough with the threads on Erlang angst for a while - time for some real questions :-) ]
>
> BACKGROUND:
> A lot of what I do is systems engineering, and a lot of that ends up in the realm of technology assessment - picking the right platform and tools for a particular system.  My dablings in Erlang are largely in that category - I keep seeing it as potentially useful for a class of systems, keep experimenting with it, done a couple proof-of-concept efforts, but haven't built an operational system at scale with it (yet).  The focus, so far, has been in modeling and simulation (I first discovered Erlang when chasing R&D contracts for a firm that built simulation engines for military trainers.  I was flabbergasted to discover that everything was written in C++, every simulated entity was an object, with 4 main loops threading through every object, 20 times a second.  Talk about spaghetti code.  Coming from a data comm. protocol/network background - where we'd spawn a process for everything - I asked the obvious question, and was told that context switches would bring a 10,000 entity simulation to its knees.  My instinctual response was "bullshit" - and went digging into the technology for massive concurrency, and discovered Erlang.)
>
> Anyway....  For years, I've been finding myself in situations, and on projects, that have a common characteristic of linked documents that change a lot - in the general arena of planning and workflow. Lots of people, each editing different parts of different documents - with changes rippling through the collection.  Think linked spreadsheets, tiered project plans, multi-level engineering documents with lots of inter-dependencies.  To be more concrete: systems engineering documents, large proposals, business planning systems, command and control systems.
>
> Add in requirements for disconnected operation that lead to distribution/replication requirements rather than keeping single, central copies of things (as the librarians like to say, "Lots of Copies Keeps Stuff Safe").
>
> So far we've always taken conventional approaches - ranging from manual paper shuffling and xeroxing, to file servers with manual organization, to some of MS Office's document linking capabilities, to document databases and sharepoint.  And played with some XML database technologies.
>
> But.... I keep thinking that there are a set of underlying functions that beg for better tools - something like a distributed CVS that's optimized for planning documents rather than software (or perhaps something like a modernized Lotus Notes).
>
> And I keep thinking that the obvious architectural model is to treat each document (maybe each page) as an actor ("smart documents" if you will), with communication through publish-subscribe mechanisms. Interact with a (copy of) a document, changes get pushed to groups of documents via a pub-sub mechanism.  (Not unlike actor based simulation approaches.)
>
> And, of course, when I think actors, I think Erlang.  The obvious conceptualization is "every document is an actor."
>
> At which point an obvious question comes up:  How to handle long-term persistence, for large numbers of inactive entities.
>
> But... when I go looking for examples of systems that might be built this way, I keep finding that, even in Erlang-based systems, persistence is handled in fairly conventional ways:
> - One might think that CouchDB treats every document as an actor, but think again
> - Paulo Negri has given some great presentations on how Wooga implements large-scale social gaming - and they implement an actor per session - but when a user goes off-line they push state into a more conventional database  (then initialize a gen_server from the database, when the user comes back online)
>
> At which point the phrase "actor-oriented database" keeps coming back to mind, with the obvious analogy to "object-oriented databases."  I.e., something with the persistence and other characteristics of a database, where the contents are actors - with all the characteristics and functionality of those actors preserved while stored in the database.
>
> ON TO THE QUESTIONS:
> I have a pretty good understanding of how one would build things like simulations, or protocol servers, with Erlang - not so much how one might build something with long-term persistence - which leads to some questions (some, probably naive):
>
> 1. So far, I haven't seen anything that actually looks like an "actor-oriented database."  Document databases implemented in Erlang, yes (e.g., CouchDB), but every example I find ultimately pushes persistent data into files or a more conventional database of some sort.  Can anybody point to an example of something that looks more like "storing actors in a database?"
> - It strikes me that the core issues with doing so have to do with maintaining "aliveness" - i.e., dealing with addressability, routing messages to a stored actor, waking up after a timeout (i.e., the equivalent of triggers)
>
> 2. One obvious (if simplistic) thought: Does one really need to think in terms of a "database" at all - or might this problem be approached simply by creating each document as an Erlang process, and keeping it around forever?  Most of what I've seen built in Erlang focuses on relatively short-lived actors - I'd be really interested in comments on:
> - limitations/issues in persisting 100s of 1000s, or maybe millions of actors, for extended periods of time (years, or decades)
> - are there any tools/models for migrating (swapping?) inactive processes dynamically to/from disk storage
>
> 3. What about backup for the state of a process?  'Let it crash' is great for servers supporting a reliable protocol, not so great for an actor that has  internal state that has to be preserved (like a simulated tank, or a "smart document"). Pushing into a database is obvious, but...
> - are there any good models for saving/restoring state within a tree of supervised processes?
> - what about models for synchronizing state across replicated copies of processes running on different nodes?
> - what about backup/restore of entire Erlang VMs (including anything that might be swapped out onto disk)
>
> 4. For communications between/among actors:  Erlang is obviously excellent for writing pub-sub engines (RabbitMQ and ejabberd come to mind), but what about pub-sub or multicast/broadcast models or messaging between Erlang processes?  Are there any good libraries for defining/managing process groups, and doing multicast or broadcast messaging to/among a group of processes.
>
> Thank you very much for any pointers or thoughts.
>
> Miles Fidelman
>
>
>
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is.   .... Yogi Berra
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140217/00f4b5d7/attachment.html>

Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Miles Fidelman-2
Well thanks, and there are some interesting ideas there - particularly
re. addressing, but...

"A distributed SQL database with the scalability of a KV store."
and uses sqlite as the back end

Not quite what I'm looking for.  Not really a "database of actors" in
the way that, say Gemstone is an "Object oriented database"

Sergej Jurecko wrote:

> http://www.actordb.com/
>
>
> Sergej
>
> On Feb 17, 2014, at 3:20 PM, Miles Fidelman wrote:
>
>> [Enough with the threads on Erlang angst for a while - time for some
>> real questions :-) ]
>>
>> BACKGROUND:
>> A lot of what I do is systems engineering, and a lot of that ends up
>> in the realm of technology assessment - picking the right platform
>> and tools for a particular system.  My dablings in Erlang are largely
>> in that category - I keep seeing it as potentially useful for a class
>> of systems, keep experimenting with it, done a couple
>> proof-of-concept efforts, but haven't built an operational system at
>> scale with it (yet).  The focus, so far, has been in modeling and
>> simulation (I first discovered Erlang when chasing R&D contracts for
>> a firm that built simulation engines for military trainers.  I was
>> flabbergasted to discover that everything was written in C++, every
>> simulated entity was an object, with 4 main loops threading through
>> every object, 20 times a second.  Talk about spaghetti code.  Coming
>> from a data comm. protocol/network background - where we'd spawn a
>> process for everything - I asked the obvious question, and was told
>> that context switches would bring a 10,000 entity simulation to its
>> knees.  My instinctual response was "bullshit" - and went digging
>> into the technology for massive concurrency, and discovered Erlang.)
>>
>> Anyway....  For years, I've been finding myself in situations, and on
>> projects, that have a common characteristic of linked documents that
>> change a lot - in the general arena of planning and workflow. Lots of
>> people, each editing different parts of different documents - with
>> changes rippling through the collection.  Think linked spreadsheets,
>> tiered project plans, multi-level engineering documents with lots of
>> inter-dependencies.  To be more concrete: systems engineering
>> documents, large proposals, business planning systems, command and
>> control systems.
>>
>> Add in requirements for disconnected operation that lead to
>> distribution/replication requirements rather than keeping single,
>> central copies of things (as the librarians like to say, "Lots of
>> Copies Keeps Stuff Safe").
>>
>> So far we've always taken conventional approaches - ranging from
>> manual paper shuffling and xeroxing, to file servers with manual
>> organization, to some of MS Office's document linking capabilities,
>> to document databases and sharepoint.  And played with some XML
>> database technologies.
>>
>> But.... I keep thinking that there are a set of underlying functions
>> that beg for better tools - something like a distributed CVS that's
>> optimized for planning documents rather than software (or perhaps
>> something like a modernized Lotus Notes).
>>
>> And I keep thinking that the obvious architectural model is to treat
>> each document (maybe each page) as an actor ("smart documents" if you
>> will), with communication through publish-subscribe mechanisms.
>> Interact with a (copy of) a document, changes get pushed to groups of
>> documents via a pub-sub mechanism.  (Not unlike actor based
>> simulation approaches.)
>>
>> And, of course, when I think actors, I think Erlang.  The obvious
>> conceptualization is "every document is an actor."
>>
>> At which point an obvious question comes up:  How to handle long-term
>> persistence, for large numbers of inactive entities.
>>
>> But... when I go looking for examples of systems that might be built
>> this way, I keep finding that, even in Erlang-based systems,
>> persistence is handled in fairly conventional ways:
>> - One might think that CouchDB treats every document as an actor, but
>> think again
>> - Paulo Negri has given some great presentations on how Wooga
>> implements large-scale social gaming - and they implement an actor
>> per session - but when a user goes off-line they push state into a
>> more conventional database  (then initialize a gen_server from the
>> database, when the user comes back online)
>>
>> At which point the phrase "actor-oriented database" keeps coming back
>> to mind, with the obvious analogy to "object-oriented databases."
>>  I.e., something with the persistence and other characteristics of a
>> database, where the contents are actors - with all the
>> characteristics and functionality of those actors preserved while
>> stored in the database.
>>
>> ON TO THE QUESTIONS:
>> I have a pretty good understanding of how one would build things like
>> simulations, or protocol servers, with Erlang - not so much how one
>> might build something with long-term persistence - which leads to
>> some questions (some, probably naive):
>>
>> 1. So far, I haven't seen anything that actually looks like an
>> "actor-oriented database."  Document databases implemented in Erlang,
>> yes (e.g., CouchDB), but every example I find ultimately pushes
>> persistent data into files or a more conventional database of some
>> sort.  Can anybody point to an example of something that looks more
>> like "storing actors in a database?"
>> - It strikes me that the core issues with doing so have to do with
>> maintaining "aliveness" - i.e., dealing with addressability, routing
>> messages to a stored actor, waking up after a timeout (i.e., the
>> equivalent of triggers)
>>
>> 2. One obvious (if simplistic) thought: Does one really need to think
>> in terms of a "database" at all - or might this problem be approached
>> simply by creating each document as an Erlang process, and keeping it
>> around forever?  Most of what I've seen built in Erlang focuses on
>> relatively short-lived actors - I'd be really interested in comments on:
>> - limitations/issues in persisting 100s of 1000s, or maybe millions
>> of actors, for extended periods of time (years, or decades)
>> - are there any tools/models for migrating (swapping?) inactive
>> processes dynamically to/from disk storage
>>
>> 3. What about backup for the state of a process?  'Let it crash' is
>> great for servers supporting a reliable protocol, not so great for an
>> actor that has  internal state that has to be preserved (like a
>> simulated tank, or a "smart document"). Pushing into a database is
>> obvious, but...
>> - are there any good models for saving/restoring state within a tree
>> of supervised processes?
>> - what about models for synchronizing state across replicated copies
>> of processes running on different nodes?
>> - what about backup/restore of entire Erlang VMs (including anything
>> that might be swapped out onto disk)
>>
>> 4. For communications between/among actors:  Erlang is obviously
>> excellent for writing pub-sub engines (RabbitMQ and ejabberd come to
>> mind), but what about pub-sub or multicast/broadcast models or
>> messaging between Erlang processes?  Are there any good libraries for
>> defining/managing process groups, and doing multicast or broadcast
>> messaging to/among a group of processes.
>>
>> Thank you very much for any pointers or thoughts.
>>
>> Miles Fidelman
>>
>>
>>
>>
>> --
>> In theory, there is no difference between theory and practice.
>> In practice, there is.   .... Yogi Berra
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions <mailto:erlang-questions>
>> http://erlang.org/mailman/listinfo/erlang-questions
>


--
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra


Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Sergej Jurecko
Each actor is isolated and independent. This is why using sqlite is completely scalable. I thought it would be quite clear that it fits your ideas well. So let me expand more on your questions.


>>>
>>>
>>> 1. So far, I haven't seen anything that actually looks like an "actor-oriented database."  Document databases implemented in Erlang, yes (e.g., CouchDB), but every example I find ultimately pushes persistent data into files or a more conventional database of some sort.  Can anybody point to an example of something that looks more like "storing actors in a database?"
>>> - It strikes me that the core issues with doing so have to do with maintaining "aliveness" - i.e., dealing with addressability, routing messages to a stored actor, waking up after a timeout (i.e., the equivalent of triggers)
>>>

Storing the actor state in an sql database gives you the most flexible data model. All the issues you listed are something actordb deals with.

>>> 2. One obvious (if simplistic) thought: Does one really need to think in terms of a "database" at all - or might this problem be approached simply by creating each document as an Erlang process, and keeping it around forever?  Most of what I've seen built in Erlang focuses on relatively short-lived actors - I'd be really interested in comments on:
>>> - limitations/issues in persisting 100s of 1000s, or maybe millions of actors, for extended periods of time (years, or decades)
>>> - are there any tools/models for migrating (swapping?) inactive processes dynamically to/from disk storage
>>>

ActorDB will keep an actor open while it is doing something. Once it stops doing stuff it is closed and remains untouched on disk until it is used again.

>>> 3. What about backup for the state of a process?  'Let it crash' is great for servers supporting a reliable protocol, not so great for an actor that has  internal state that has to be preserved (like a simulated tank, or a "smart document"). Pushing into a database is obvious, but...
>>> - are there any good models for saving/restoring state within a tree of supervised processes?
>>> - what about models for synchronizing state across replicated copies of processes running on different nodes?
>>> - what about backup/restore of entire Erlang VMs (including anything that might be swapped out onto disk)
>>>

ActorDB will replicate state to multiple servers.

>>> 4. For communications between/among actors:  Erlang is obviously excellent for writing pub-sub engines (RabbitMQ and ejabberd come to mind), but what about pub-sub or multicast/broadcast models or messaging between Erlang processes?  Are there any good libraries for defining/managing process groups, and doing multicast or broadcast messaging to/among a group of processes.
>>>

This is where a separation of code and data comes in. A use case that ActorDB could easily be expanded to do (without a lot of work). ActorDB handles the storage and persistence of state. The application programer implements his program logic in his own code.

ActorDB could provide "triggers". For instance you have an actor (some gen_server) running on a node:
- When it wants to store the result of some work, it stores it in his actor in ActorDB.
- If it wants to send a message to one or many actors, he would create a write transaction to multiple actors. The transaction would do an insert to their "events" table. ActorDB would then callback your own code for every one of those actors and tell them there is work for them to do. This way no event is ever lost.
- If you just want to send non-persistent messages to erlang processes you can use pg or pg2 modules and sidestep ActorDB. Or maybe it makes sense to support sending non-persistent messages through ActorDB and rely on it to route those messages to the right processes on the right nodes.

Callbacks could be directly within erlang or externally connected with http or zeromq. So your app logic can be implemented in anything.


Sergej

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140217/8572f187/attachment.html>

Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Miles Fidelman-2
In reply to this post by Jesper Louis Andersen
Guys, please - I'm not looking at modeling "a document" - I'm looking at
modeling 100s of copies of documents, each with local changes, that are
currently being updated (say, my copy of a project plan); as well as
families of related documents (say a schedule, a budget, progress
reports). At any given time, I may make a marginal note, or an edit, or
an edit - and I want that to propagate according to local rules.

The best conceptual model I come up with is: each document is an actor,
with a mailbox, and a multi-cast pub-sub mechanism for sending changes
to a group of people who have copies of the document.  (Think of a
loose-leaf binder, and change pages coming through the mail).  For that
matter, think USENET news and NNTP - except with each message being
addressible - instead of a thread as 100 messages, it becomes one
message, plus 99 updates to that message, each processed by code within
the 1st message.

If I wanted to model this as a standard database, or serializing state
into a traditional database, I wouldn't be asking the questions I
asked.  Can anybody talk to the questions I actually asked, about:
- handling large numbers of actors that might persist for years, or
decades (where actor = Erlang-style process)
- backup up/restoring state of long-running actors that might crash
- multi-cast messaging among actors

Erlang is the closest I've found to an environment where this might be
practical - but that remains an open question.  So please - while
appreciated, I'm really not looking for information about alternative
conceptual models; I'm looking for hard information about how this
conceptual model might be implemented in Erlang.

Miles

Michael Radford wrote:

> I'd suggest taking a look at Riak, and also Basho's library riak-core.
> With Riak and a bit of Erlang, you can easily model a document as a
> sequence of change operations which are composed on-demand to present
> the latest version. On top of that, you get mechanisms for maintaining
> the database without any single point of failure, and for dealing with
> simultaneous/competing changes from multiple users.
>
> For persisting actors, the nice thing about Erlang is that pretty much
> whatever your actor's state is [*], you can store its term_to_binary
> representation in whatever database you choose. [* except for
> anonymous functions, which you can always turn into atoms as long as
> they're not completely arbitrary.]
>
> On Mon, Feb 17, 2014 at 6:50 AM, Jesper Louis Andersen
> <jesper.louis.andersen> wrote:
>> A document is a trace of events. These events records edits to the document
>> and when we play all of the events, we obtain the final document state.
>> Infinite undo is possible by looking back and replaying with a point-in-time
>> recovery option. An actor is a handler that can apply events to a state in
>> order to obtain a new state.
>>
>> Events are persisted in an event log and WAL fashion. So even if the system
>> dies, we can replay its state safely. Once in a while, living processes
>> checkpoint their state to disk so they can boot up faster than having to
>> replay from day 0.
>>
>> Multiple edits to the same document can be handled by operational transforms
>>
>> http://en.wikipedia.org/wiki/Operational_transformation
>>
>> Idle documents terminate themselves after a while by checkpointing
>> themselves to disk. Documents register themselves into gproc and if there is
>> no document present in gproc, you go to a manager and get it set up either
>> from disk or by forming a new document.
>>
>> For easy storage, you can use a single table in a database for the log.
>>
>>
>> On Mon, Feb 17, 2014 at 3:20 PM, Miles Fidelman <mfidelman>
>> wrote:
>>> [Enough with the threads on Erlang angst for a while - time for some real
>>> questions :-) ]
>>>
>>> BACKGROUND:
>>> A lot of what I do is systems engineering, and a lot of that ends up in
>>> the realm of technology assessment - picking the right platform and tools
>>> for a particular system.  My dablings in Erlang are largely in that category
>>> - I keep seeing it as potentially useful for a class of systems, keep
>>> experimenting with it, done a couple proof-of-concept efforts, but haven't
>>> built an operational system at scale with it (yet).  The focus, so far, has
>>> been in modeling and simulation (I first discovered Erlang when chasing R&D
>>> contracts for a firm that built simulation engines for military trainers.  I
>>> was flabbergasted to discover that everything was written in C++, every
>>> simulated entity was an object, with 4 main loops threading through every
>>> object, 20 times a second.  Talk about spaghetti code.  Coming from a data
>>> comm. protocol/network background - where we'd spawn a process for
>>> everything - I asked the obvious question, and was told that context
>>> switches would bring a 10,000 entity simulation to its knees.  My
>>> instinctual response was "bullshit" - and went digging into the technology
>>> for massive concurrency, and discovered Erlang.)
>>>
>>> Anyway....  For years, I've been finding myself in situations, and on
>>> projects, that have a common characteristic of linked documents that change
>>> a lot - in the general arena of planning and workflow. Lots of people, each
>>> editing different parts of different documents - with changes rippling
>>> through the collection.  Think linked spreadsheets, tiered project plans,
>>> multi-level engineering documents with lots of inter-dependencies.  To be
>>> more concrete: systems engineering documents, large proposals, business
>>> planning systems, command and control systems.
>>>
>>> Add in requirements for disconnected operation that lead to
>>> distribution/replication requirements rather than keeping single, central
>>> copies of things (as the librarians like to say, "Lots of Copies Keeps Stuff
>>> Safe").
>>>
>>> So far we've always taken conventional approaches - ranging from manual
>>> paper shuffling and xeroxing, to file servers with manual organization, to
>>> some of MS Office's document linking capabilities, to document databases and
>>> sharepoint.  And played with some XML database technologies.
>>>
>>> But.... I keep thinking that there are a set of underlying functions that
>>> beg for better tools - something like a distributed CVS that's optimized for
>>> planning documents rather than software (or perhaps something like a
>>> modernized Lotus Notes).
>>>
>>> And I keep thinking that the obvious architectural model is to treat each
>>> document (maybe each page) as an actor ("smart documents" if you will), with
>>> communication through publish-subscribe mechanisms. Interact with a (copy
>>> of) a document, changes get pushed to groups of documents via a pub-sub
>>> mechanism.  (Not unlike actor based simulation approaches.)
>>>
>>> And, of course, when I think actors, I think Erlang.  The obvious
>>> conceptualization is "every document is an actor."
>>>
>>> At which point an obvious question comes up:  How to handle long-term
>>> persistence, for large numbers of inactive entities.
>>>
>>> But... when I go looking for examples of systems that might be built this
>>> way, I keep finding that, even in Erlang-based systems, persistence is
>>> handled in fairly conventional ways:
>>> - One might think that CouchDB treats every document as an actor, but
>>> think again
>>> - Paulo Negri has given some great presentations on how Wooga implements
>>> large-scale social gaming - and they implement an actor per session - but
>>> when a user goes off-line they push state into a more conventional database
>>> (then initialize a gen_server from the database, when the user comes back
>>> online)
>>>
>>> At which point the phrase "actor-oriented database" keeps coming back to
>>> mind, with the obvious analogy to "object-oriented databases."  I.e.,
>>> something with the persistence and other characteristics of a database,
>>> where the contents are actors - with all the characteristics and
>>> functionality of those actors preserved while stored in the database.
>>>
>>> ON TO THE QUESTIONS:
>>> I have a pretty good understanding of how one would build things like
>>> simulations, or protocol servers, with Erlang - not so much how one might
>>> build something with long-term persistence - which leads to some questions
>>> (some, probably naive):
>>>
>>> 1. So far, I haven't seen anything that actually looks like an
>>> "actor-oriented database."  Document databases implemented in Erlang, yes
>>> (e.g., CouchDB), but every example I find ultimately pushes persistent data
>>> into files or a more conventional database of some sort.  Can anybody point
>>> to an example of something that looks more like "storing actors in a
>>> database?"
>>> - It strikes me that the core issues with doing so have to do with
>>> maintaining "aliveness" - i.e., dealing with addressability, routing
>>> messages to a stored actor, waking up after a timeout (i.e., the equivalent
>>> of triggers)
>>>
>>> 2. One obvious (if simplistic) thought: Does one really need to think in
>>> terms of a "database" at all - or might this problem be approached simply by
>>> creating each document as an Erlang process, and keeping it around forever?
>>> Most of what I've seen built in Erlang focuses on relatively short-lived
>>> actors - I'd be really interested in comments on:
>>> - limitations/issues in persisting 100s of 1000s, or maybe millions of
>>> actors, for extended periods of time (years, or decades)
>>> - are there any tools/models for migrating (swapping?) inactive processes
>>> dynamically to/from disk storage
>>>
>>> 3. What about backup for the state of a process?  'Let it crash' is great
>>> for servers supporting a reliable protocol, not so great for an actor that
>>> has  internal state that has to be preserved (like a simulated tank, or a
>>> "smart document"). Pushing into a database is obvious, but...
>>> - are there any good models for saving/restoring state within a tree of
>>> supervised processes?
>>> - what about models for synchronizing state across replicated copies of
>>> processes running on different nodes?
>>> - what about backup/restore of entire Erlang VMs (including anything that
>>> might be swapped out onto disk)
>>>
>>> 4. For communications between/among actors:  Erlang is obviously excellent
>>> for writing pub-sub engines (RabbitMQ and ejabberd come to mind), but what
>>> about pub-sub or multicast/broadcast models or messaging between Erlang
>>> processes?  Are there any good libraries for defining/managing process
>>> groups, and doing multicast or broadcast messaging to/among a group of
>>> processes.
>>>
>>> Thank you very much for any pointers or thoughts.
>>>
>>> Miles Fidelman
>>>
>>>
>>>
>>>
>>> --
>>> In theory, there is no difference between theory and practice.
>>> In practice, there is.   .... Yogi Berra
>>>
>>> _______________________________________________
>>> erlang-questions mailing list
>>> erlang-questions
>>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>>
>>
>> --
>> J.
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions
>> http://erlang.org/mailman/listinfo/erlang-questions
>>


--
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra


Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Michał Ptaszek
In reply to this post by Miles Fidelman-2
This sounds interesting. To start wit,  I think swapping processes to disk
is just an optimization.
In theory you could just keep everything in RAM forever. I guess processes
could keep their state in dictionaries (so you could roll them back) or ets
tables (if you didn't want to roll them back).

You would need some form of crash recovery so processes should write some
state information
to disk at suitable points in the program.

What I think is a more serious problem is getting data into the system in
the first place.
I have done some experiments with document commenting and annotation
systems and
found it very difficult to convert things like word documents into a form
that looks half
decent in a user interface.

I want to parse Microsoft word files and PDF etc. and display them in a
format that is
recognisable and not too abhorrent to the user. I also want to allow
on-screen manipulation of
documents (in a browser) - all of this seems to require a mess of
Javascript (in the browser)and a mess of parsing programs inn the server.

Before we can manipulate documents we must parse them and turn them into a
format
that can be manipulated. I think this is more difficult that the storing
and manipulating documents
problem. You'd also need support for full-text indexing, foreign language
and multiple character sets and so
on. Just a load of horrible messy small problems, but a significant barrier
to importing large amounts
of content into the system.

You'd also need some quality control of the documents as they enter the
system (to avoid rubbish in rubbish out), also to maintain the integrity of
the documents.

If you have any ideas of now to get large volumes of data into the system
from proprietary formats
(like ms word) I'd like to hear about it.

Cheers

/Joe





On Mon, Feb 17, 2014 at 3:20 PM, Miles Fidelman
<mfidelman>wrote:

> [Enough with the threads on Erlang angst for a while - time for some real
> questions :-) ]
>
> BACKGROUND:
> A lot of what I do is systems engineering, and a lot of that ends up in
> the realm of technology assessment - picking the right platform and tools
> for a particular system.  My dablings in Erlang are largely in that
> category - I keep seeing it as potentially useful for a class of systems,
> keep experimenting with it, done a couple proof-of-concept efforts, but
> haven't built an operational system at scale with it (yet).  The focus, so
> far, has been in modeling and simulation (I first discovered Erlang when
> chasing R&D contracts for a firm that built simulation engines for military
> trainers.  I was flabbergasted to discover that everything was written in
> C++, every simulated entity was an object, with 4 main loops threading
> through every object, 20 times a second.  Talk about spaghetti code.
>  Coming from a data comm. protocol/network background - where we'd spawn a
> process for everything - I asked the obvious question, and was told that
> context switches would bring a 10,000 entity simulation to its knees.  My
> instinctual response was "bullshit" - and went digging into the technology
> for massive concurrency, and discovered Erlang.)
>
> Anyway....  For years, I've been finding myself in situations, and on
> projects, that have a common characteristic of linked documents that change
> a lot - in the general arena of planning and workflow. Lots of people, each
> editing different parts of different documents - with changes rippling
> through the collection.  Think linked spreadsheets, tiered project plans,
> multi-level engineering documents with lots of inter-dependencies.  To be
> more concrete: systems engineering documents, large proposals, business
> planning systems, command and control systems.
>
> Add in requirements for disconnected operation that lead to
> distribution/replication requirements rather than keeping single, central
> copies of things (as the librarians like to say, "Lots of Copies Keeps
> Stuff Safe").
>
> So far we've always taken conventional approaches - ranging from manual
> paper shuffling and xeroxing, to file servers with manual organization, to
> some of MS Office's document linking capabilities, to document databases
> and sharepoint.  And played with some XML database technologies.
>
> But.... I keep thinking that there are a set of underlying functions that
> beg for better tools - something like a distributed CVS that's optimized
> for planning documents rather than software (or perhaps something like a
> modernized Lotus Notes).
>
> And I keep thinking that the obvious architectural model is to treat each
> document (maybe each page) as an actor ("smart documents" if you will),
> with communication through publish-subscribe mechanisms. Interact with a
> (copy of) a document, changes get pushed to groups of documents via a
> pub-sub mechanism.  (Not unlike actor based simulation approaches.)
>
> And, of course, when I think actors, I think Erlang.  The obvious
> conceptualization is "every document is an actor."
>
> At which point an obvious question comes up:  How to handle long-term
> persistence, for large numbers of inactive entities.
>
> But... when I go looking for examples of systems that might be built this
> way, I keep finding that, even in Erlang-based systems, persistence is
> handled in fairly conventional ways:
> - One might think that CouchDB treats every document as an actor, but
> think again
> - Paulo Negri has given some great presentations on how Wooga implements
> large-scale social gaming - and they implement an actor per session - but
> when a user goes off-line they push state into a more conventional database
>  (then initialize a gen_server from the database, when the user comes back
> online)
>
> At which point the phrase "actor-oriented database" keeps coming back to
> mind, with the obvious analogy to "object-oriented databases."  I.e.,
> something with the persistence and other characteristics of a database,
> where the contents are actors - with all the characteristics and
> functionality of those actors preserved while stored in the database.
>
> ON TO THE QUESTIONS:
> I have a pretty good understanding of how one would build things like
> simulations, or protocol servers, with Erlang - not so much how one might
> build something with long-term persistence - which leads to some questions
> (some, probably naive):
>
> 1. So far, I haven't seen anything that actually looks like an
> "actor-oriented database."  Document databases implemented in Erlang, yes
> (e.g., CouchDB), but every example I find ultimately pushes persistent data
> into files or a more conventional database of some sort.  Can anybody point
> to an example of something that looks more like "storing actors in a
> database?"
> - It strikes me that the core issues with doing so have to do with
> maintaining "aliveness" - i.e., dealing with addressability, routing
> messages to a stored actor, waking up after a timeout (i.e., the equivalent
> of triggers)
>
> 2. One obvious (if simplistic) thought: Does one really need to think in
> terms of a "database" at all - or might this problem be approached simply
> by creating each document as an Erlang process, and keeping it around
> forever?  Most of what I've seen built in Erlang focuses on relatively
> short-lived actors - I'd be really interested in comments on:
> - limitations/issues in persisting 100s of 1000s, or maybe millions of
> actors, for extended periods of time (years, or decades)
> - are there any tools/models for migrating (swapping?) inactive processes
> dynamically to/from disk storage
>
> 3. What about backup for the state of a process?  'Let it crash' is great
> for servers supporting a reliable protocol, not so great for an actor that
> has  internal state that has to be preserved (like a simulated tank, or a
> "smart document"). Pushing into a database is obvious, but...
> - are there any good models for saving/restoring state within a tree of
> supervised processes?
> - what about models for synchronizing state across replicated copies of
> processes running on different nodes?
> - what about backup/restore of entire Erlang VMs (including anything that
> might be swapped out onto disk)
>
> 4. For communications between/among actors:  Erlang is obviously excellent
> for writing pub-sub engines (RabbitMQ and ejabberd come to mind), but what
> about pub-sub or multicast/broadcast models or messaging between Erlang
> processes?  Are there any good libraries for defining/managing process
> groups, and doing multicast or broadcast messaging to/among a group of
> processes.
>
> Thank you very much for any pointers or thoughts.
>
> Miles Fidelman
>
>
>
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is.   .... Yogi Berra
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140217/7d0ed3a7/attachment.html>

Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Vance Shipley-2
The zip and xmerl applications are that is needed to turn OpenOffice native
ODF into Erlang terms. All you need to do next is everything you need to
do.  ;)
 On Feb 18, 2014 12:45 AM, "Joe Armstrong" <erlang> wrote:

> This sounds interesting. To start wit,  I think swapping processes to disk
> is just an optimization.
> In theory you could just keep everything in RAM forever. I guess processes
> could keep their state in dictionaries (so you could roll them back) or ets
> tables (if you didn't want to roll them back).
>
> You would need some form of crash recovery so processes should write some
> state information
> to disk at suitable points in the program.
>
> What I think is a more serious problem is getting data into the system in
> the first place.
> I have done some experiments with document commenting and annotation
> systems and
> found it very difficult to convert things like word documents into a form
> that looks half
> decent in a user interface.
>
> I want to parse Microsoft word files and PDF etc. and display them in a
> format that is
> recognisable and not too abhorrent to the user. I also want to allow
> on-screen manipulation of
> documents (in a browser) - all of this seems to require a mess of
> Javascript (in the browser)and a mess of parsing programs inn the server.
>
> Before we can manipulate documents we must parse them and turn them into a
> format
> that can be manipulated. I think this is more difficult that the storing
> and manipulating documents
> problem. You'd also need support for full-text indexing, foreign language
> and multiple character sets and so
> on. Just a load of horrible messy small problems, but a significant
> barrier to importing large amounts
> of content into the system.
>
> You'd also need some quality control of the documents as they enter the
> system (to avoid rubbish in rubbish out), also to maintain the integrity of
> the documents.
>
> If you have any ideas of now to get large volumes of data into the system
> from proprietary formats
> (like ms word) I'd like to hear about it.
>
> Cheers
>
> /Joe
>
>
>
>
>
> On Mon, Feb 17, 2014 at 3:20 PM, Miles Fidelman <
> mfidelman> wrote:
>
>> [Enough with the threads on Erlang angst for a while - time for some real
>> questions :-) ]
>>
>> BACKGROUND:
>> A lot of what I do is systems engineering, and a lot of that ends up in
>> the realm of technology assessment - picking the right platform and tools
>> for a particular system.  My dablings in Erlang are largely in that
>> category - I keep seeing it as potentially useful for a class of systems,
>> keep experimenting with it, done a couple proof-of-concept efforts, but
>> haven't built an operational system at scale with it (yet).  The focus, so
>> far, has been in modeling and simulation (I first discovered Erlang when
>> chasing R&D contracts for a firm that built simulation engines for military
>> trainers.  I was flabbergasted to discover that everything was written in
>> C++, every simulated entity was an object, with 4 main loops threading
>> through every object, 20 times a second.  Talk about spaghetti code.
>>  Coming from a data comm. protocol/network background - where we'd spawn a
>> process for everything - I asked the obvious question, and was told that
>> context switches would bring a 10,000 entity simulation to its knees.  My
>> instinctual response was "bullshit" - and went digging into the technology
>> for massive concurrency, and discovered Erlang.)
>>
>> Anyway....  For years, I've been finding myself in situations, and on
>> projects, that have a common characteristic of linked documents that change
>> a lot - in the general arena of planning and workflow. Lots of people, each
>> editing different parts of different documents - with changes rippling
>> through the collection.  Think linked spreadsheets, tiered project plans,
>> multi-level engineering documents with lots of inter-dependencies.  To be
>> more concrete: systems engineering documents, large proposals, business
>> planning systems, command and control systems.
>>
>> Add in requirements for disconnected operation that lead to
>> distribution/replication requirements rather than keeping single, central
>> copies of things (as the librarians like to say, "Lots of Copies Keeps
>> Stuff Safe").
>>
>> So far we've always taken conventional approaches - ranging from manual
>> paper shuffling and xeroxing, to file servers with manual organization, to
>> some of MS Office's document linking capabilities, to document databases
>> and sharepoint.  And played with some XML database technologies.
>>
>> But.... I keep thinking that there are a set of underlying functions that
>> beg for better tools - something like a distributed CVS that's optimized
>> for planning documents rather than software (or perhaps something like a
>> modernized Lotus Notes).
>>
>> And I keep thinking that the obvious architectural model is to treat each
>> document (maybe each page) as an actor ("smart documents" if you will),
>> with communication through publish-subscribe mechanisms. Interact with a
>> (copy of) a document, changes get pushed to groups of documents via a
>> pub-sub mechanism.  (Not unlike actor based simulation approaches.)
>>
>> And, of course, when I think actors, I think Erlang.  The obvious
>> conceptualization is "every document is an actor."
>>
>> At which point an obvious question comes up:  How to handle long-term
>> persistence, for large numbers of inactive entities.
>>
>> But... when I go looking for examples of systems that might be built this
>> way, I keep finding that, even in Erlang-based systems, persistence is
>> handled in fairly conventional ways:
>> - One might think that CouchDB treats every document as an actor, but
>> think again
>> - Paulo Negri has given some great presentations on how Wooga implements
>> large-scale social gaming - and they implement an actor per session - but
>> when a user goes off-line they push state into a more conventional database
>>  (then initialize a gen_server from the database, when the user comes back
>> online)
>>
>> At which point the phrase "actor-oriented database" keeps coming back to
>> mind, with the obvious analogy to "object-oriented databases."  I.e.,
>> something with the persistence and other characteristics of a database,
>> where the contents are actors - with all the characteristics and
>> functionality of those actors preserved while stored in the database.
>>
>> ON TO THE QUESTIONS:
>> I have a pretty good understanding of how one would build things like
>> simulations, or protocol servers, with Erlang - not so much how one might
>> build something with long-term persistence - which leads to some questions
>> (some, probably naive):
>>
>> 1. So far, I haven't seen anything that actually looks like an
>> "actor-oriented database."  Document databases implemented in Erlang, yes
>> (e.g., CouchDB), but every example I find ultimately pushes persistent data
>> into files or a more conventional database of some sort.  Can anybody point
>> to an example of something that looks more like "storing actors in a
>> database?"
>> - It strikes me that the core issues with doing so have to do with
>> maintaining "aliveness" - i.e., dealing with addressability, routing
>> messages to a stored actor, waking up after a timeout (i.e., the equivalent
>> of triggers)
>>
>> 2. One obvious (if simplistic) thought: Does one really need to think in
>> terms of a "database" at all - or might this problem be approached simply
>> by creating each document as an Erlang process, and keeping it around
>> forever?  Most of what I've seen built in Erlang focuses on relatively
>> short-lived actors - I'd be really interested in comments on:
>> - limitations/issues in persisting 100s of 1000s, or maybe millions of
>> actors, for extended periods of time (years, or decades)
>> - are there any tools/models for migrating (swapping?) inactive processes
>> dynamically to/from disk storage
>>
>> 3. What about backup for the state of a process?  'Let it crash' is great
>> for servers supporting a reliable protocol, not so great for an actor that
>> has  internal state that has to be preserved (like a simulated tank, or a
>> "smart document"). Pushing into a database is obvious, but...
>> - are there any good models for saving/restoring state within a tree of
>> supervised processes?
>> - what about models for synchronizing state across replicated copies of
>> processes running on different nodes?
>> - what about backup/restore of entire Erlang VMs (including anything that
>> might be swapped out onto disk)
>>
>> 4. For communications between/among actors:  Erlang is obviously
>> excellent for writing pub-sub engines (RabbitMQ and ejabberd come to mind),
>> but what about pub-sub or multicast/broadcast models or messaging between
>> Erlang processes?  Are there any good libraries for defining/managing
>> process groups, and doing multicast or broadcast messaging to/among a group
>> of processes.
>>
>> Thank you very much for any pointers or thoughts.
>>
>> Miles Fidelman
>>
>>
>>
>>
>> --
>> In theory, there is no difference between theory and practice.
>> In practice, there is.   .... Yogi Berra
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions
>> http://erlang.org/mailman/listinfo/erlang-questions
>>
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140218/b87b1991/attachment.html>

Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Miles Fidelman-2
In reply to this post by Michał Ptaszek
Joe Armstrong wrote:
> This sounds interesting. To start wit,  I think swapping processes to
> disk is just an optimization.
> In theory you could just keep everything in RAM forever. I guess
> processes could keep their state in dictionaries (so you could roll
> them back) or ets tables (if you didn't want to roll them back).
>
> You would need some form of crash recovery so processes should write
> some state information
> to disk at suitable points in the program.

Joe...  can you offer any insight into the dynamics of Erlang, when
running with large number of processes that have very long persistence?  
Somehow, it strikes me that 100,000 processes with 1MB of state, each
running for years at a time, have a different dynamic than 100,000
processes, each representing a short-lived protocol transaction (say a
web query).

Coupled with a communications paradigm for identifying a group of
processes and sending each of them the same message (e.g., 5000 people
have a copy of a book, send all 5000 of them a set of errata; or send a
message asking 'who has updates for section 3.2).

In some sense, the conceptual model is:
1. I send you an empty notebook.
2. The notebook has an address and a bunch of message handling routines
3. I can send a page to the notebook, and the notebook inserts the page.
4. You can interact with the notebook - read it, annotate it, edit
certain sections - if you make updates, the notebook can distribute
updates to other copies - either through a P2P mechanism or a
publish-subscribe mechanism.

At a basic level, this maps really well onto the Actor formalism - every
notebook is an actor, with it's own address.  Updates, interactions,
queries, etc. are simply messages.

Since Erlang is about the only serious implementation of the Actor
formalism, I'm trying to poke at the edge cases - particularly around
long-lived actors.  And who better to ask than you :-)

In passing: Early versions of Smalltalk were actor-like, encapsulating
state, methods, and process - but process kind of got dropped along the
way.  By contrast, it strikes me that Erlang focuses on everything being
a process, and long-term persistence of state has taken a back seat.  
I'm trying to probe the edge cases. (I guess another way of looking at
this is: to what extent is Erlang workable for writing systems based
around the mobile agent paradigm?)



>
> What I think is a more serious problem is getting data into the system
> in the first place.
> I have done some experiments with document commenting and annotation
> systems and
> found it very difficult to convert things like word documents into a
> form that looks half
> decent in a user interface.

Haven't actually thought a lot about that part of the problem. I'm
thinking of documents that are more form-like in nature, or at least
built up from smaller components - so it's not so much going from Word
to an internal format, as much as starting with XML or JSON (or tuples),
building up structure, and then adding presentation at the final step.  
XML -> Word is a lot easier than the reverse :-)

On the other hand, I do have a bunch of applications in mind where
parsing Word and/or PDF would be very helpful - notably stripping
requirements out of specifications.  (I can't tell you how much of my
time I spend manually cutting and pasting from specifications into
spreadsheets - for requirements tracking and such.)  Again, presentation
isn't that much of an issue - structural and semantic analysis is.  But,
while important, that's a separate set of problems - and there are some
commercial products that do a reasonably good job.

> I want to parse Microsoft word files and PDF etc. and display them in
> a format that is
> recognisable and not too abhorrent to the user. I also want to allow
> on-screen manipulation of
> documents (in a browser) - all of this seems to require a mess of
> Javascript (in the browser)and a mess of parsing programs inn the server.
>
> Before we can manipulate documents we must parse them and turn them
> into a format
> that can be manipulated. I think this is more difficult that the
> storing and manipulating documents
> problem. You'd also need support for full-text indexing, foreign
> language and multiple character sets and so
> on. Just a load of horrible messy small problems, but a significant
> barrier to importing large amounts
> of content into the system.
>
> You'd also need some quality control of the documents as they enter
> the system (to avoid rubbish in rubbish out), also to maintain the
> integrity of the documents.

Again, for this problem space, it's more about building up complex
documents from small pieces, than carving up pre-existing documents.  
More like the combination of an IDE and a distributed CVS - where fully
"compiled" documents are the final output.

>
> If you have any ideas of now to get large volumes of data into the
> system from proprietary formats
> (like ms word) I'd like to hear about it.
>

Me too :-)  Though, I go looking for such things every once in a while, and:
- there are quite a few PDF to XML parsers, but mostly commercial ones
- there are a few PDF and Word "RFP stripping" products floating around,
that are smart enough to actually analyze the content of structured
documents (check out Meridian)
- later versions of Word export XML, albeit poor XML
- there are quite a few document analysis packages floating around,
including ones that start from OCR images - but they generally focus on
content (lexical analyis) and ignore structure (it's easier to scan a
document and extract some measure of what it's about - e.g. for indexing
purposes; it's a lot harder to find something that will extract the
outline structure of a document)


Cheers,

Miles


--
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra


Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Mahesh Paolini-Subramanya
?Large number of processes with very long persistence?

You *will* run into GC issues here, and of all kinds
? ?- design artifacts (?hmm, the number of lists that I manipulate increases relentlessly??)
? ?- misunderstanding (?But I passed the binary on, without manipulating it at all!?)
? ?- Bugs (Fred has a great writeup on this somewhere)
?
Just keep in mind that in the end, you will almost certainly end up doing some form of manual GC activities. ?Again, the Heroku gang can probably provide a whole bunch of pointers on this?

chees
Mahesh Paolini-Subramanya
That tall bald Indian guy..  
Google+? |?Blog?? |?Twitter? |?LinkedIn

On February 17, 2014 at 3:22:22 PM, Miles Fidelman (mfidelman) wrote:

Joe Armstrong wrote:  
> This sounds interesting. To start wit, I think swapping processes to  
> disk is just an optimization.  
> In theory you could just keep everything in RAM forever. I guess  
> processes could keep their state in dictionaries (so you could roll  
> them back) or ets tables (if you didn't want to roll them back).  
>  
> You would need some form of crash recovery so processes should write  
> some state information  
> to disk at suitable points in the program.  

Joe... can you offer any insight into the dynamics of Erlang, when  
running with large number of processes that have very long persistence?  
Somehow, it strikes me that 100,000 processes with 1MB of state, each  
running for years at a time, have a different dynamic than 100,000  
processes, each representing a short-lived protocol transaction (say a  
web query).  

Coupled with a communications paradigm for identifying a group of  
processes and sending each of them the same message (e.g., 5000 people  
have a copy of a book, send all 5000 of them a set of errata; or send a  
message asking 'who has updates for section 3.2).  

In some sense, the conceptual model is:  
1. I send you an empty notebook.  
2. The notebook has an address and a bunch of message handling routines  
3. I can send a page to the notebook, and the notebook inserts the page.  
4. You can interact with the notebook - read it, annotate it, edit  
certain sections - if you make updates, the notebook can distribute  
updates to other copies - either through a P2P mechanism or a  
publish-subscribe mechanism.  

At a basic level, this maps really well onto the Actor formalism - every  
notebook is an actor, with it's own address. Updates, interactions,  
queries, etc. are simply messages.  

Since Erlang is about the only serious implementation of the Actor  
formalism, I'm trying to poke at the edge cases - particularly around  
long-lived actors. And who better to ask than you :-)  

In passing: Early versions of Smalltalk were actor-like, encapsulating  
state, methods, and process - but process kind of got dropped along the  
way. By contrast, it strikes me that Erlang focuses on everything being  
a process, and long-term persistence of state has taken a back seat.  
I'm trying to probe the edge cases. (I guess another way of looking at  
this is: to what extent is Erlang workable for writing systems based  
around the mobile agent paradigm?)  



>  
> What I think is a more serious problem is getting data into the system  
> in the first place.  
> I have done some experiments with document commenting and annotation  
> systems and  
> found it very difficult to convert things like word documents into a  
> form that looks half  
> decent in a user interface.  

Haven't actually thought a lot about that part of the problem. I'm  
thinking of documents that are more form-like in nature, or at least  
built up from smaller components - so it's not so much going from Word  
to an internal format, as much as starting with XML or JSON (or tuples),  
building up structure, and then adding presentation at the final step.  
XML -> Word is a lot easier than the reverse :-)  

On the other hand, I do have a bunch of applications in mind where  
parsing Word and/or PDF would be very helpful - notably stripping  
requirements out of specifications. (I can't tell you how much of my  
time I spend manually cutting and pasting from specifications into  
spreadsheets - for requirements tracking and such.) Again, presentation  
isn't that much of an issue - structural and semantic analysis is. But,  
while important, that's a separate set of problems - and there are some  
commercial products that do a reasonably good job.  

> I want to parse Microsoft word files and PDF etc. and display them in  
> a format that is  
> recognisable and not too abhorrent to the user. I also want to allow  
> on-screen manipulation of  
> documents (in a browser) - all of this seems to require a mess of  
> Javascript (in the browser)and a mess of parsing programs inn the server.  
>  
> Before we can manipulate documents we must parse them and turn them  
> into a format  
> that can be manipulated. I think this is more difficult that the  
> storing and manipulating documents  
> problem. You'd also need support for full-text indexing, foreign  
> language and multiple character sets and so  
> on. Just a load of horrible messy small problems, but a significant  
> barrier to importing large amounts  
> of content into the system.  
>  
> You'd also need some quality control of the documents as they enter  
> the system (to avoid rubbish in rubbish out), also to maintain the  
> integrity of the documents.  

Again, for this problem space, it's more about building up complex  
documents from small pieces, than carving up pre-existing documents.  
More like the combination of an IDE and a distributed CVS - where fully  
"compiled" documents are the final output.  

>  
> If you have any ideas of now to get large volumes of data into the  
> system from proprietary formats  
> (like ms word) I'd like to hear about it.  
>  

Me too :-) Though, I go looking for such things every once in a while, and:  
- there are quite a few PDF to XML parsers, but mostly commercial ones  
- there are a few PDF and Word "RFP stripping" products floating around,  
that are smart enough to actually analyze the content of structured  
documents (check out Meridian)  
- later versions of Word export XML, albeit poor XML  
- there are quite a few document analysis packages floating around,  
including ones that start from OCR images - but they generally focus on  
content (lexical analyis) and ignore structure (it's easier to scan a  
document and extract some measure of what it's about - e.g. for indexing  
purposes; it's a lot harder to find something that will extract the  
outline structure of a document)  


Cheers,  

Miles  


--  
In theory, there is no difference between theory and practice.  
In practice, there is. .... Yogi Berra  

_______________________________________________  
erlang-questions mailing list  
erlang-questions  
http://erlang.org/mailman/listinfo/erlang-questions 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140217/108859e2/attachment.html>

Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Miles Fidelman-2
In reply to this post by Miles Fidelman-2
Ahh... thanks!  Now I know what to look for as a start perusing the
riak_core architectural documents.

Re. a couple of your other points:

Michael Radford wrote:
> I was addressing exactly your second point: backing up and restoring actors.
>
> Since you brought up that point, it didn't sound like an "actor" in
> your model is 1-1 with an Erlang process for all time. So OK, you can
> checkpoint them and restart them. But what if you want the "actor" to
> continue to "respond to messages" when any single machine crashes?
> Maybe multiple Erlang processes on multiple machines can coordinate to
> provide a highly available representation of your per-document actor.

In some sense, I keep coming back to the model of paper documents - but
just a little bit "smarter."
- I have a copy of the same document at the office, at home, in my
briefcase, someone else has a copy
- updates can be distributed via copy machine and mail
- with paper, if one of them gets lost or stolen, there are other
copies, but you lose marginal notes
- add a little "intelligence" and now copies can be synced (a la what we
do with our smart phones and laptops) - with the sync logic embedded in
the document rather than the environment (which is makes the actor-like
rather than simply files or object-like)
- which does address one approach to backup - things are mirrored, if a
node crashes, there are other copies around, it's easy to restore state
(the only time things get lost is if one is running disconnected and
things crash) -- though it becomes pretty inefficient if the power
fails, or you want to reboot the machine, or what have you
>
> For the same reason, it sounds like you need reliable messaging, so
> that a locally-triggered change isn't lost if a machine crashes. So
> you may need to persist the update messages in multiple places.

I'm thinking that this is embedded in publish subscribe channels, with
logs/archives.  I keep coming back to a USENET/NNTP model for both
messaging and replication.
>
> Finally, you also mentioned "copies of documents" with "local changes"
> triggering changes in other copies of the documents. What if the
> locally-triggered changes conflict?

For this, I'm thinking that each document has an embedded change control
system.  Akin to having one's own copy of a git repository. (The Fossil
distributed version control system is a really nice model for this.)
>
> For example, a budget with a total, where any user can add or remove
> line items. To compute the current total, you need rules for resolving
> conflicting additions and removals of the same line item. Since you
> mentioned that these changes are "local" and you're sending messages
> among the document actors, it sounded like you had something like this
> in mind.

Exactly.  Which is one of the things that leads me to the actor model -
embed the rules directly in each copy of the document, rather than in
external logic.
>
> If any of these things are a concern for your application, I thought
> you might find the ideas behind riak-core helpful at least.

Thanks!  I'm going off to look now!


--
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra


Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Fred Hebert
In reply to this post by Mahesh Paolini-Subramanya
On 02/17, Mahesh Paolini-Subramanya wrote:
> ? ?- Bugs (Fred has a great writeup on this somewhere)
> ?
> Just keep in mind that in the end, you will almost certainly end up doing some form of manual GC activities. ?Again, the Heroku gang can probably provide a whole bunch of pointers on this?
>

That would probably be the blog post on my investigations and
optimizations in Logplex:

https://blog.heroku.com/archives/2013/11/7/logplex-down-the-rabbit-hole

Chess,
Fred.

Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Michael Truog-2
In reply to this post by Mahesh Paolini-Subramanya
You then have the choice of trying to tweak the GC with parameters to try and avoid consuming too much memory, but that only works if your throughput stays roughly the same (within what you expect), otherwise you then have to play with the GC settings again, based on a new throughput max... or, you can use a short-lived process to manipulate memory, such that the result is returned upon the short-lived process' death (the third option would be manually making garbage collection happen, which is dirty, but possible with http://www.erlang.org/doc/man/erlang.html#garbage_collect-0).  So, using short-lived processes to facilitate the work of longer-lived processes is generally the solution to this problem.  Making an Erlang process is cheap, and using a short-lived process to help the GC know what memory is old, is a simple way of handling the problem.  If you use CloudI, the cloudi_service behaviour does this for you by default, when you receive service requests, with the
request_pid_uses service configuration option (http://cloudi.org/api.html#2_services_add_config_opts)... so it is part of CloudI's service abstraction.


On 02/17/2014 12:42 PM, Mahesh Paolini-Subramanya wrote:

> "Large number of processes with very long persistence"
>
> You *will* run into GC issues here, and of all kinds
>    - design artifacts ("hmm, the number of lists that I manipulate increases relentlessly...")
>    - misunderstanding ("But I passed the binary on, without manipulating it at all!")
>    - Bugs (Fred has a great writeup on this somewhere)
> Just keep in mind that in the end, you will almost certainly end up doing some form of manual GC activities.  Again, the Heroku gang can probably provide a whole bunch of pointers on this...
>
> chees
> *
> *Mahesh Paolini-Subramanya <http://www.gravatar.com/avatar/204a87f81a0d9764c1f3364f53e8facf.png>*
> That tall bald Indian guy..
> *
> *
> Google+ <https://plus.google.com/u/0/108074935470209044442/posts>  | Blog <http://dieswaytoofast.blogspot.com/> | Twitter <https://twitter.com/dieswaytoofast>| LinkedIn <http://www.linkedin.com/in/dieswaytoofast>
> *
>
> On February 17, 2014 at 3:22:22 PM, Miles Fidelman (mfidelman <mailto://mfidelman>) wrote:
>
>> Joe Armstrong wrote:
>> > This sounds interesting. To start wit, I think swapping processes to
>> > disk is just an optimization.
>> > In theory you could just keep everything in RAM forever. I guess
>> > processes could keep their state in dictionaries (so you could roll
>> > them back) or ets tables (if you didn't want to roll them back).
>> >
>> > You would need some form of crash recovery so processes should write
>> > some state information
>> > to disk at suitable points in the program.
>>
>> Joe... can you offer any insight into the dynamics of Erlang, when
>> running with large number of processes that have very long persistence?
>> Somehow, it strikes me that 100,000 processes with 1MB of state, each
>> running for years at a time, have a different dynamic than 100,000
>> processes, each representing a short-lived protocol transaction (say a
>> web query).
>>
>> Coupled with a communications paradigm for identifying a group of
>> processes and sending each of them the same message (e.g., 5000 people
>> have a copy of a book, send all 5000 of them a set of errata; or send a
>> message asking 'who has updates for section 3.2).
>>
>> In some sense, the conceptual model is:
>> 1. I send you an empty notebook.
>> 2. The notebook has an address and a bunch of message handling routines
>> 3. I can send a page to the notebook, and the notebook inserts the page.
>> 4. You can interact with the notebook - read it, annotate it, edit
>> certain sections - if you make updates, the notebook can distribute
>> updates to other copies - either through a P2P mechanism or a
>> publish-subscribe mechanism.
>>
>> At a basic level, this maps really well onto the Actor formalism - every
>> notebook is an actor, with it's own address. Updates, interactions,
>> queries, etc. are simply messages.
>>
>> Since Erlang is about the only serious implementation of the Actor
>> formalism, I'm trying to poke at the edge cases - particularly around
>> long-lived actors. And who better to ask than you :-)
>>
>> In passing: Early versions of Smalltalk were actor-like, encapsulating
>> state, methods, and process - but process kind of got dropped along the
>> way. By contrast, it strikes me that Erlang focuses on everything being
>> a process, and long-term persistence of state has taken a back seat.
>> I'm trying to probe the edge cases. (I guess another way of looking at
>> this is: to what extent is Erlang workable for writing systems based
>> around the mobile agent paradigm?)
>>
>>
>>
>> >
>> > What I think is a more serious problem is getting data into the system
>> > in the first place.
>> > I have done some experiments with document commenting and annotation
>> > systems and
>> > found it very difficult to convert things like word documents into a
>> > form that looks half
>> > decent in a user interface.
>>
>> Haven't actually thought a lot about that part of the problem. I'm
>> thinking of documents that are more form-like in nature, or at least
>> built up from smaller components - so it's not so much going from Word
>> to an internal format, as much as starting with XML or JSON (or tuples),
>> building up structure, and then adding presentation at the final step.
>> XML -> Word is a lot easier than the reverse :-)
>>
>> On the other hand, I do have a bunch of applications in mind where
>> parsing Word and/or PDF would be very helpful - notably stripping
>> requirements out of specifications. (I can't tell you how much of my
>> time I spend manually cutting and pasting from specifications into
>> spreadsheets - for requirements tracking and such.) Again, presentation
>> isn't that much of an issue - structural and semantic analysis is. But,
>> while important, that's a separate set of problems - and there are some
>> commercial products that do a reasonably good job.
>>
>> > I want to parse Microsoft word files and PDF etc. and display them in
>> > a format that is
>> > recognisable and not too abhorrent to the user. I also want to allow
>> > on-screen manipulation of
>> > documents (in a browser) - all of this seems to require a mess of
>> > Javascript (in the browser)and a mess of parsing programs inn the server.
>> >
>> > Before we can manipulate documents we must parse them and turn them
>> > into a format
>> > that can be manipulated. I think this is more difficult that the
>> > storing and manipulating documents
>> > problem. You'd also need support for full-text indexing, foreign
>> > language and multiple character sets and so
>> > on. Just a load of horrible messy small problems, but a significant
>> > barrier to importing large amounts
>> > of content into the system.
>> >
>> > You'd also need some quality control of the documents as they enter
>> > the system (to avoid rubbish in rubbish out), also to maintain the
>> > integrity of the documents.
>>
>> Again, for this problem space, it's more about building up complex
>> documents from small pieces, than carving up pre-existing documents.
>> More like the combination of an IDE and a distributed CVS - where fully
>> "compiled" documents are the final output.
>>
>> >
>> > If you have any ideas of now to get large volumes of data into the
>> > system from proprietary formats
>> > (like ms word) I'd like to hear about it.
>> >
>>
>> Me too :-) Though, I go looking for such things every once in a while, and:
>> - there are quite a few PDF to XML parsers, but mostly commercial ones
>> - there are a few PDF and Word "RFP stripping" products floating around,
>> that are smart enough to actually analyze the content of structured
>> documents (check out Meridian)
>> - later versions of Word export XML, albeit poor XML
>> - there are quite a few document analysis packages floating around,
>> including ones that start from OCR images - but they generally focus on
>> content (lexical analyis) and ignore structure (it's easier to scan a
>> document and extract some measure of what it's about - e.g. for indexing
>> purposes; it's a lot harder to find something that will extract the
>> outline structure of a document)
>>
>>
>> Cheers,
>>
>> Miles
>>
>>
>> --
>> In theory, there is no difference between theory and practice.
>> In practice, there is. .... Yogi Berra
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions
>> http://erlang.org/mailman/listinfo/erlang-questions
>
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140217/1d853265/attachment.html>

Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Michał Ptaszek
In reply to this post by Miles Fidelman-2
On Mon, Feb 17, 2014 at 9:22 PM, Miles Fidelman
<mfidelman>wrote:

> Joe Armstrong wrote:
>
>> This sounds interesting. To start wit,  I think swapping processes to
>> disk is just an optimization.
>> In theory you could just keep everything in RAM forever. I guess
>> processes could keep their state in dictionaries (so you could roll them
>> back) or ets tables (if you didn't want to roll them back).
>>
>> You would need some form of crash recovery so processes should write some
>> state information
>> to disk at suitable points in the program.
>>
>
> Joe...  can you offer any insight into the dynamics of Erlang, when
> running with large number of processes that have very long persistence?
>  Somehow, it strikes me that 100,000 processes with 1MB of state, each
> running for years at a time, have a different dynamic than 100,000
> processes, each representing a short-lived protocol transaction (say a web
> query).
>

T


>
> Coupled with a communications paradigm for identifying a group of
> processes and sending each of them the same message (e.g., 5000 people have
> a copy of a book, send all 5000 of them a set of errata; or send a message
> asking 'who has updates for section 3.2).
>
> In some sense, the conceptual model is:
> 1. I send you an empty notebook.
> 2. The notebook has an address and a bunch of message handling routines
> 3. I can send a page to the notebook, and the notebook inserts the page.
> 4. You can interact with the notebook - read it, annotate it, edit certain
> sections - if you make updates, the notebook can distribute updates to
> other copies - either through a P2P mechanism or a publish-subscribe
> mechanism.
>
> At a basic level, this maps really well onto the Actor formalism - every
> notebook is an actor, with it's own address.  Updates, interactions,
> queries, etc. are simply messages.
>
> Since Erlang is about the only serious implementation of the Actor
> formalism, I'm trying to poke at the edge cases - particularly around
> long-lived actors.  And who better to ask than you :-)
>
> In passing: Early versions of Smalltalk were actor-like, encapsulating
> state, methods, and process - but process kind of got dropped along the
> way.  By contrast, it strikes me that Erlang focuses on everything being a
> process, and long-term persistence of state has taken a back seat.  I'm
> trying to probe the edge cases. (I guess another way of looking at this is:
> to what extent is Erlang workable for writing systems based around the
> mobile agent paradigm?)
>
>
>
>
>
>> What I think is a more serious problem is getting data into the system in
>> the first place.
>> I have done some experiments with document commenting and annotation
>> systems and
>> found it very difficult to convert things like word documents into a form
>> that looks half
>> decent in a user interface.
>>
>
> Haven't actually thought a lot about that part of the problem. I'm
> thinking of documents that are more form-like in nature, or at least built
> up from smaller components - so it's not so much going from Word to an
> internal format, as much as starting with XML or JSON (or tuples), building
> up structure, and then adding presentation at the final step.  XML -> Word
> is a lot easier than the reverse :-)
>
> On the other hand, I do have a bunch of applications in mind where parsing
> Word and/or PDF would be very helpful - notably stripping requirements out
> of specifications.  (I can't tell you how much of my time I spend manually
> cutting and pasting from specifications into spreadsheets - for
> requirements tracking and such.)  Again, presentation isn't that much of an
> issue - structural and semantic analysis is.  But, while important, that's
> a separate set of problems - and there are some commercial products that do
> a reasonably good job.
>
>
>  I want to parse Microsoft word files and PDF etc. and display them in a
>> format that is
>> recognisable and not too abhorrent to the user. I also want to allow
>> on-screen manipulation of
>> documents (in a browser) - all of this seems to require a mess of
>> Javascript (in the browser)and a mess of parsing programs inn the server.
>>
>> Before we can manipulate documents we must parse them and turn them into
>> a format
>> that can be manipulated. I think this is more difficult that the storing
>> and manipulating documents
>> problem. You'd also need support for full-text indexing, foreign language
>> and multiple character sets and so
>> on. Just a load of horrible messy small problems, but a significant
>> barrier to importing large amounts
>> of content into the system.
>>
>> You'd also need some quality control of the documents as they enter the
>> system (to avoid rubbish in rubbish out), also to maintain the integrity of
>> the documents.
>>
>
> Again, for this problem space, it's more about building up complex
> documents from small pieces, than carving up pre-existing documents.  More
> like the combination of an IDE and a distributed CVS - where fully
> "compiled" documents are the final output.
>
>
>
>> If you have any ideas of now to get large volumes of data into the system
>> from proprietary formats
>> (like ms word) I'd like to hear about it.
>>
>>
> Me too :-)  Though, I go looking for such things every once in a while,
> and:
> - there are quite a few PDF to XML parsers, but mostly commercial ones
> - there are a few PDF and Word "RFP stripping" products floating around,
> that are smart enough to actually analyze the content of structured
> documents (check out Meridian)
> - later versions of Word export XML, albeit poor XML
> - there are quite a few document analysis packages floating around,
> including ones that start from OCR images - but they generally focus on
> content (lexical analyis) and ignore structure (it's easier to scan a
> document and extract some measure of what it's about - e.g. for indexing
> purposes; it's a lot harder to find something that will extract the outline
> structure of a document)
>
>
> Cheers,
>
> Miles
>
>
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is.   .... Yogi Berra
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140217/666068c1/attachment.html>

Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Miles Fidelman-2
In reply to this post by Mahesh Paolini-Subramanya
Mahesh,

Mahesh Paolini-Subramanya wrote:
> ?Large number of processes with very long persistence?
>
> You *will* run into GC issues here, and of all kinds
>    - design artifacts (?hmm, the number of lists that I manipulate
> increases relentlessly??)
>    - misunderstanding (?But I passed the binary on, without
> manipulating it at all!?)
>    - Bugs (Fred has a great writeup on this somewhere)

Very good points - though to a degree they sound more like dependency
hell than traditional garbage collection to reclaim memory.

Given the document-oriented view, I'm viewing garbage collection more in
the sense of filing and archiving - the same way that paper documents
migrate to filerooms then to archives; or email and computer files
simply get buried deeper and deeper in one's file system; sometimes you
buy a larger drive; sometimes stuff migrates to off-site backup - but
you generally don't throw stuff away (though when working on
multi-author documents, one always comes back to how many intermediate
copies to retain "for the record" after the final version goes to print).

In one sense, this ends up looking a lot like managing a git repository
- more and more versions and branches accumulate, and so forth.  And
once starts thinking about storing only change logs.

This is also what motivates my question about how to handle older,
largely inactive processes.  It's one thing to bury a file deeper and
deeper in a file system - and still be able to find and access it (and
these days, search for it).  It's another to think about migrating an
actor from RAM to disk, in a way that retains its ability to respond to
the infrequent message.

The other area I worry about is exponential growth in network traffic
and cpu cycles - assuming that a lot of documents will never completely
"die" - maybe an update will come in once week, or once a month, or
they'll get searched every once in a while - as the number of processes
increases, the amount of traffic will as well.

> Just keep in mind that in the end, you will almost certainly end up
> doing some form of manual GC activities.  Again, the Heroku gang can
> probably provide a whole bunch of pointers on this?
>

Can you say a bit more about what it is about Heroku that I should be
looking at?  At first glance, it seems like a very different environment
than what we're talking about here (or are you thinking about manual
housekeeping for the virtual environment?).

And.. re. "Bugs (Fred has a great writeup on this somewhere)" - Fred
who?  (Maybe I can find it by googling!)

Thanks Very Much,

Miles

--
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra


Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Michał Ptaszek
In reply to this post by Miles Fidelman-2
On Mon, Feb 17, 2014 at 9:22 PM, Miles Fidelman
<mfidelman>wrote:

> Joe Armstrong wrote:
>
>> This sounds interesting. To start wit,  I think swapping processes to
>> disk is just an optimization.
>> In theory you could just keep everything in RAM forever. I guess
>> processes could keep their state in dictionaries (so you could roll them
>> back) or ets tables (if you didn't want to roll them back).
>>
>> You would need some form of crash recovery so processes should write some
>> state information
>> to disk at suitable points in the program.
>>
>
> Joe...  can you offer any insight into the dynamics of Erlang, when
> running with large number of processes that have very long persistence?


No - this area has not to my knowledge been investigated. The "use lots of
processes" or "as many processes as necessary" has an implicit assumption
that  a) the processes are not very large and
b) not very long lived. At the back of my mind I'm thinking of a) as "a few
hundred KB resident size" and
b) a few seconds to minutes. I'm *not* thinking MBs and years. The latter
requirements fit into our
"telecoms domain" - a few thousands to tens of thousands of computations
living for "the length of a telephone call" ie (max) hours but not years.

Some kind of "getting things out of memory and onto disk when not needed"
layer is needed for your problem,



>  Somehow, it strikes me that 100,000 processes with 1MB of state, each
> running for years at a time, have a different dynamic than 100,000
> processes, each representing a short-lived protocol transaction (say a web
> query).
>

My first comment is, thanks for providing some numbers :-) I keep saying
time and time
again, don't ask questions without numbers. 100K processes with 1MB of
state = 10^11 bytes
so you'd need a really big machine to do this. Assuming say 8GB of memory
and 1MB of state
you'd have an upper limit of 8K processes. This assumes a regular spinning
disk. I guess if you have
a big SSD the story changes.

So you either have to reduce the size of the state, or the number of
processes. The state can (I suppose) be partitioned into a (small) index
and a (larger) content. So I'd keep the index in memory and the content
on disk (or cached).


>
> Coupled with a communications paradigm for identifying a group of
> processes and sending each of them the same message (e.g., 5000 people have
> a copy of a book, send all 5000 of them a set of errata; or send a message
> asking 'who has updates for section 3.2).
>

Hopefully all 5000 people will not want the errata at the same time


>
> In some sense, the conceptual model is:
> 1. I send you an empty notebook.
> 2. The notebook has an address and a bunch of message handling routines
> 3. I can send a page to the notebook, and the notebook inserts the page.
> 4. You can interact with the notebook - read it, annotate it, edit certain
> sections - if you make updates, the notebook can distribute updates to
> other copies - either through a P2P mechanism or a publish-subscribe
> mechanism.
>
> At a basic level, this maps really well onto the Actor formalism - every
> notebook is an actor, with it's own address.  Updates, interactions,
> queries, etc. are simply messages.
>
> Since Erlang is about the only serious implementation of the Actor
> formalism, I'm trying to poke at the edge cases - particularly around
> long-lived actors.  And who better to ask than you :-)
>

It's a very good question. I like questions that poke around at the edges
of what is possible :-)


>
> In passing: Early versions of Smalltalk were actor-like, encapsulating
> state, methods, and process - but process kind of got dropped along the
> way.  By contrast, it strikes me that Erlang focuses on everything being a
> process, and long-term persistence of state has taken a back seat.


Yes - I guess the real solution would be to change the scheduler to swap
processes to disk after they had waited for more than (say) 10 minutes for
a message, and resurrect them when they are sent a message.

The idea that they might be swapped out for years hadn't occurred to me.


>  I'm trying to probe the edge cases. (I guess another way of looking at
> this is: to what extent is Erlang workable for writing systems based around
> the mobile agent paradigm?)


Pass - at the moment you'd have to implement you own object layer to do
this .
I guess you could do this yourself by making send and receive library
routines and
making the state of a process explicit rather than implicit, then slicking
everything into
a large store (like riak). If you cache the active processes in memory this
might behave
well enough.


>
>
>
>
>> What I think is a more serious problem is getting data into the system in
>> the first place.
>> I have done some experiments with document commenting and annotation
>> systems and
>> found it very difficult to convert things like word documents into a form
>> that looks half
>> decent in a user interface.
>>
>
> Haven't actually thought a lot about that part of the problem. I'm
> thinking of documents that are more form-like in nature, or at least built
> up from smaller components - so it's not so much going from Word to an
> internal format, as much as starting with XML or JSON (or tuples), building
> up structure, and then adding presentation at the final step.  XML -> Word
> is a lot easier than the reverse :-)
>
> On the other hand, I do have a bunch of applications in mind where parsing
> Word and/or PDF would be very helpful - notably stripping requirements out
> of specifications.  (I can't tell you how much of my time I spend manually
> cutting and pasting from specifications into spreadsheets - for
> requirements tracking and such.)  Again, presentation isn't that much of an
> issue - structural and semantic analysis is.  But, while important, that's
> a separate set of problems - and there are some commercial products that do
> a reasonably good job.
>
>
>  I want to parse Microsoft word files and PDF etc. and display them in a
>> format that is
>> recognisable and not too abhorrent to the user. I also want to allow
>> on-screen manipulation of
>> documents (in a browser) - all of this seems to require a mess of
>> Javascript (in the browser)and a mess of parsing programs inn the server.
>>
>> Before we can manipulate documents we must parse them and turn them into
>> a format
>> that can be manipulated. I think this is more difficult that the storing
>> and manipulating documents
>> problem. You'd also need support for full-text indexing, foreign language
>> and multiple character sets and so
>> on. Just a load of horrible messy small problems, but a significant
>> barrier to importing large amounts
>> of content into the system.
>>
>> You'd also need some quality control of the documents as they enter the
>> system (to avoid rubbish in rubbish out), also to maintain the integrity of
>> the documents.
>>
>
> Again, for this problem space, it's more about building up complex
> documents from small pieces, than carving up pre-existing documents.  More
> like the combination of an IDE and a distributed CVS - where fully
> "compiled" documents are the final output.
>
>
>
>> If you have any ideas of now to get large volumes of data into the system
>> from proprietary formats
>> (like ms word) I'd like to hear about it.
>>
>>
> Me too :-)  Though, I go looking for such things every once in a while,
> and:
> - there are quite a few PDF to XML parsers, but mostly commercial ones
>

Suck - then you have to buy them to find out if they are any good


> - there are a few PDF and Word "RFP stripping" products floating around,
> that are smart enough to actually analyze the content of structured
> documents (check out Meridian)
>


> - later versions of Word export XML, albeit poor XML
>

Which sucks



> - there are quite a few document analysis packages floating around,
> including ones that start from OCR images - but they generally focus on
> content (lexical analyis) and ignore structure (it's easier to scan a
> document and extract some measure of what it's about - e.g. for indexing
> purposes; it's a lot harder to find something that will extract the outline
> structure of a document)
>
>
> Cheers,
>
> Miles
>
>
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is.   .... Yogi Berra
>
> _______________________________________________
> erlang-questions mailing list
> erlang-questions
> http://erlang.org/mailman/listinfo/erlang-questions
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140217/d60b1872/attachment.html>

Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Mahesh Paolini-Subramanya
In reply to this post by Miles Fidelman-2
?How to handle older largely inactive processes?
?What we do (did?) was to basically flush/hibernate the process. ?To generalize, if the process hasn?t ?done? anything in a while, save its state somewhere, and then hibernate (which has the added benefit of dealing with GC issues :-)?
Mind you, there is a wealth of info buried in the phrase ?save its state somewhere?. ?This really, really?depends ?on how big, how scalable, how fault-tolerant, how geographic, how?.. you intend on getting. In short, it can range from ?write out a text file? to ?pay Riak gobs-o-money and have nodes worldwide?. YMMV

?Heroku gang? <? My apologies, I meant ?those fine fine folks at Heroku who happen to be doing erlang?

cheers

Mahesh Paolini-Subramanya
That tall bald Indian guy..  
Google+? |?Blog?? |?Twitter? |?LinkedIn

On February 17, 2014 at 4:40:22 PM, Miles Fidelman (mfidelman) wrote:

Mahesh,

Mahesh Paolini-Subramanya wrote:
> ?Large number of processes with very long persistence?
>
> You *will* run into GC issues here, and of all kinds
> - design artifacts (?hmm, the number of lists that I manipulate  
> increases relentlessly??)
> - misunderstanding (?But I passed the binary on, without  
> manipulating it at all!?)
> - Bugs (Fred has a great writeup on this somewhere)

Very good points - though to a degree they sound more like dependency  
hell than traditional garbage collection to reclaim memory.

Given the document-oriented view, I'm viewing garbage collection more in  
the sense of filing and archiving - the same way that paper documents  
migrate to filerooms then to archives; or email and computer files  
simply get buried deeper and deeper in one's file system; sometimes you  
buy a larger drive; sometimes stuff migrates to off-site backup - but  
you generally don't throw stuff away (though when working on  
multi-author documents, one always comes back to how many intermediate  
copies to retain "for the record" after the final version goes to print).

In one sense, this ends up looking a lot like managing a git repository  
- more and more versions and branches accumulate, and so forth. And  
once starts thinking about storing only change logs.

This is also what motivates my question about how to handle older,  
largely inactive processes. It's one thing to bury a file deeper and  
deeper in a file system - and still be able to find and access it (and  
these days, search for it). It's another to think about migrating an  
actor from RAM to disk, in a way that retains its ability to respond to  
the infrequent message.

The other area I worry about is exponential growth in network traffic  
and cpu cycles - assuming that a lot of documents will never completely  
"die" - maybe an update will come in once week, or once a month, or  
they'll get searched every once in a while - as the number of processes  
increases, the amount of traffic will as well.

> Just keep in mind that in the end, you will almost certainly end up  
> doing some form of manual GC activities. Again, the Heroku gang can  
> probably provide a whole bunch of pointers on this?
>

Can you say a bit more about what it is about Heroku that I should be  
looking at? At first glance, it seems like a very different environment  
than what we're talking about here (or are you thinking about manual  
housekeeping for the virtual environment?).

And.. re. "Bugs (Fred has a great writeup on this somewhere)" - Fred  
who? (Maybe I can find it by googling!)

Thanks Very Much,

Miles

--  
In theory, there is no difference between theory and practice.
In practice, there is. .... Yogi Berra

_______________________________________________
erlang-questions mailing list
erlang-questions
http://erlang.org/mailman/listinfo/erlang-questions
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140217/3fb8e06e/attachment.html>

Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Miles Fidelman-2
Just spent some time reading Fred's blog post (thanks Fred), and
googling "Erlang garbage collection" - interesting stuff, and now I much
better understand the concerns you raise!

A follow-up if I might:

Mahesh Paolini-Subramanya wrote:

> ?How to handle older largely inactive processes?
>  What we do (did?) was to basically flush/hibernate the process.  To
> generalize, if the process hasn?t ?done? anything in a while, save its
> state somewhere, and then hibernate (which has the added benefit of
> dealing with GC issues :-)
> Mind you, there is a wealth of info buried in the phrase ?save its
> state somewhere?.  This really, /really/ depends  on how big, how
> scalable, how fault-tolerant, how geographic, how?.. you intend on
> getting. In short, it can range from ?write out a text file? to ?pay
> Riak gobs-o-money and have nodes worldwide?. YMMV

Can you say more about this?  Context in which you are
flushing/hibernating processes - and how your're saving state?

Cheers,

Miles

>
> ?Heroku gang? <? My apologies, I meant ?those fine fine folks at
> Heroku who happen to be doing erlang?
>
> cheers
>
> *
> *Mahesh Paolini-Subramanya
> <http://www.gravatar.com/avatar/204a87f81a0d9764c1f3364f53e8facf.png>*
> That tall bald Indian guy..
> *
> *
> Google+ <https://plus.google.com/u/0/108074935470209044442/posts>  |
> Blog <http://dieswaytoofast.blogspot.com/>   | Twitter
> <https://twitter.com/dieswaytoofast>  | LinkedIn
> <http://www.linkedin.com/in/dieswaytoofast>
> *
>
> On February 17, 2014 at 4:40:22 PM, Miles Fidelman
> (mfidelman <mailto://mfidelman>) wrote:
>
>> Mahesh,
>>
>> Mahesh Paolini-Subramanya wrote:
>> > ?Large number of processes with very long persistence?
>> >
>> > You *will* run into GC issues here, and of all kinds
>> > - design artifacts (?hmm, the number of lists that I manipulate
>> > increases relentlessly??)
>> > - misunderstanding (?But I passed the binary on, without
>> > manipulating it at all!?)
>> > - Bugs (Fred has a great writeup on this somewhere)
>>
>> Very good points - though to a degree they sound more like dependency
>> hell than traditional garbage collection to reclaim memory.
>>
>> Given the document-oriented view, I'm viewing garbage collection more in
>> the sense of filing and archiving - the same way that paper documents
>> migrate to filerooms then to archives; or email and computer files
>> simply get buried deeper and deeper in one's file system; sometimes you
>> buy a larger drive; sometimes stuff migrates to off-site backup - but
>> you generally don't throw stuff away (though when working on
>> multi-author documents, one always comes back to how many intermediate
>> copies to retain "for the record" after the final version goes to print).
>>
>> In one sense, this ends up looking a lot like managing a git repository
>> - more and more versions and branches accumulate, and so forth. And
>> once starts thinking about storing only change logs.
>>
>> This is also what motivates my question about how to handle older,
>> largely inactive processes. It's one thing to bury a file deeper and
>> deeper in a file system - and still be able to find and access it (and
>> these days, search for it). It's another to think about migrating an
>> actor from RAM to disk, in a way that retains its ability to respond to
>> the infrequent message.
>>
>> The other area I worry about is exponential growth in network traffic
>> and cpu cycles - assuming that a lot of documents will never completely
>> "die" - maybe an update will come in once week, or once a month, or
>> they'll get searched every once in a while - as the number of processes
>> increases, the amount of traffic will as well.
>>
>> > Just keep in mind that in the end, you will almost certainly end up
>> > doing some form of manual GC activities. Again, the Heroku gang can
>> > probably provide a whole bunch of pointers on this?
>> >
>>
>> Can you say a bit more about what it is about Heroku that I should be
>> looking at? At first glance, it seems like a very different environment
>> than what we're talking about here (or are you thinking about manual
>> housekeeping for the virtual environment?).
>>
>> And.. re. "Bugs (Fred has a great writeup on this somewhere)" - Fred
>> who? (Maybe I can find it by googling!)
>>
>> Thanks Very Much,
>>
>> Miles
>>
>> --
>> In theory, there is no difference between theory and practice.
>> In practice, there is. .... Yogi Berra
>>
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions
>> http://erlang.org/mailman/listinfo/erlang-questions


--
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra


Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Motiejus Jakštys-2
In reply to this post by Miles Fidelman-2
On Mon, Feb 17, 2014 at 7:04 PM, Miles Fidelman <mfidelman>
wrote:
>
> If I wanted to model this as a standard database, or serializing state
into
> a traditional database, I wouldn't be asking the questions I asked. Can
> anybody talk to the questions I actually asked, about:
> - handling large numbers of actors that might persist for years, or
decades
> (where actor = Erlang-style process)
> - backup up/restoring state of long-running actors that might crash
> - multi-cast messaging among actors

Hi,

Some time ago I was part of a team which created software to manage phone
number migration between mobile operators. Say you want to change your cell
phone provider (mandatory in EU and in many other countries). We were the
entity responsible for that process.

One portability request is one process. At any time we could have had up to
1M processes (practically it was much lower, but we used this number when
designing the system). A "portability process" is a finite state machine
with communication in SOAP between two or three parties with many internal
state changes.

A single process could last from a few hours up to few months (median ~3-4
days), each state up to 10-100KB of uncompressed text (mean ~15KB
uncompressed).

Having Erlang processes allowed very nice things like straightforward
programming of state transitions during timeouts.

Strict consistency requirements meant we had checkpoints in a key-value
store for every operation for every process, which was managed globally.
>From that checkpoints it was possible to re-create state replying all
actions.

We did not really manage to fully implement a proper addressing mechanism
for non-volatile message sending. We invented our own PIDs which had some
sort of address / node ownership information. The mechanism was complex and
imperfect, nothing really to learn from. AMQP might be a good candidate
though.

Note that some of the details above are not exactly true (esp. numbers),
because I can't remember all the details.

A few remarks:
1. Do *not* store full state after you change it. Implement a diff
mechanism on your abstract state tree (it's strictly defined, right?), test
it using PropEr and use that. If you require fast recovery in case of
crash, checkpoint is ok, but never drop the old state. You might dispute
the state transition after months, go fix the bug and want to re-run a
particular process transitions again next year... Ugh.
2. Long-lived processes (weeks+) are perfectly fine for Erlang VM. Just
make sure to hibernate them after some minutes of inactivity. You can
easily have hundreds of thousands, consume basically no CPU and just enough
memory to keep the internal state.

Regards,
Motiejus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20140218/f3d04d50/attachment.html>

Reply | Threaded
Open this post in threaded view
|

"actor database" - architectural strategy question

Miles Fidelman-2
In reply to this post by Michał Ptaszek
[By the way folks - all the other threads going on be damned - this is a
great community.  Thank you all for the rapid and useful input to what
is, as yet, a vaporous system concept!]


Hi Joe,

First off, thanks for the response!

Following-up, inline:


Joe Armstrong wrote:

>
>
>
> On Mon, Feb 17, 2014 at 9:22 PM, Miles Fidelman
> <mfidelman <mailto:mfidelman>> wrote:
>
>
>     Joe...  can you offer any insight into the dynamics of Erlang,
>     when running with large number of processes that have very long
>     persistence?
>
>
> No - this area has not to my knowledge been investigated. The "use
> lots of processes" or "as many processes as necessary" has an implicit
> assumption that  a) the processes are not very large and
> b) not very long lived. At the back of my mind I'm thinking of a) as
> "a few hundred KB resident size" and
> b) a few seconds to minutes. I'm *not* thinking MBs and years. The
> latter requirements fit into our
> "telecoms domain" - a few thousands to tens of thousands of
> computations living for "the length of a telephone call" ie (max)
> hours but not years.
>
> Some kind of "getting things out of memory and onto disk when not
> needed" layer is needed for your problem,

Ok.  After reading what others have said about garbage collection, this
is clearly issue number one that I'll need to address.

At first glance, it strikes me that the hibernate BIF does at least part
of what's needed - any thoughts/suggestions as to whether it might make
sense to approach this by extending hibernate, vs. something at the
application layer?  And, if it makes sense to play with the BIF, any
quick pointers to where I might find detailed documentation on how it's
implemented?

>
>      Somehow, it strikes me that 100,000 processes with 1MB of state,
>     each running for years at a time, have a different dynamic than
>     100,000 processes, each representing a short-lived protocol
>     transaction (say a web query).
>
>
> My first comment is, thanks for providing some numbers :-) I keep
> saying time and time
> again, don't ask questions without numbers. 100K processes with 1MB of
> state = 10^11 bytes
> so you'd need a really big machine to do this. Assuming say 8GB of
> memory and 1MB of state
> you'd have an upper limit of 8K processes. This assumes a regular
> spinning disk. I guess if you have
> a big SSD the story changes.
>
> So you either have to reduce the size of the state, or the number of
> processes. The state can (I suppose) be partitioned into a (small)
> index and a (larger) content. So I'd keep the index in memory and the
> content
> on disk (or cached).

Which also brings us back to keeping most of the documents in some kind
of hibernation, stored on disk, but ready to wake up if called on.

>
>     Coupled with a communications paradigm for identifying a group of
>     processes and sending each of them the same message (e.g., 5000
>     people have a copy of a book, send all 5000 of them a set of
>     errata; or send a message asking 'who has updates for section 3.2).
>
>
> Hopefully all 5000 people will not want the errata at the same time

Here's where I think pub-sub and replication.

>
>
>     In some sense, the conceptual model is:
>     1. I send you an empty notebook.
>     2. The notebook has an address and a bunch of message handling
>     routines
>     3. I can send a page to the notebook, and the notebook inserts the
>     page.
>     4. You can interact with the notebook - read it, annotate it, edit
>     certain sections - if you make updates, the notebook can
>     distribute updates to other copies - either through a P2P
>     mechanism or a publish-subscribe mechanism.
>
>     At a basic level, this maps really well onto the Actor formalism -
>     every notebook is an actor, with it's own address.  Updates,
>     interactions, queries, etc. are simply messages.
>
>     Since Erlang is about the only serious implementation of the Actor
>     formalism, I'm trying to poke at the edge cases - particularly
>     around long-lived actors.  And who better to ask than you :-)
>
>
> It's a very good question. I like questions that poke around at the
> edges of what is possible :-)
>
>
>     In passing: Early versions of Smalltalk were actor-like,
>     encapsulating state, methods, and process - but process kind of
>     got dropped along the way.  By contrast, it strikes me that Erlang
>     focuses on everything being a process, and long-term persistence
>     of state has taken a back seat.
>
>
> Yes - I guess the real solution would be to change the scheduler to
> swap processes to disk after they had waited for more than (say) 10
> minutes for a message, and resurrect them when they are sent a message.


Any thoughts on how to do this - perhaps in combination with extending
the hibernate BIF?


Cheers,

Miles


------ nothing new below here --------

>
> The idea that they might be swapped out for years hadn't occurred to me.
>
>      I'm trying to probe the edge cases. (I guess another way of
>     looking at this is: to what extent is Erlang workable for writing
>     systems based around the mobile agent paradigm?)
>
>
> Pass - at the moment you'd have to implement you own object layer to
> do this .
> I guess you could do this yourself by making send and receive library
> routines and
> making the state of a process explicit rather than implicit, then
> slicking everything into
> a large store (like riak). If you cache the active processes in memory
> this might behave
> well enough.
>
>
>
>
>
>
>         What I think is a more serious problem is getting data into
>         the system in the first place.
>         I have done some experiments with document commenting and
>         annotation systems and
>         found it very difficult to convert things like word documents
>         into a form that looks half
>         decent in a user interface.
>
>
>     Haven't actually thought a lot about that part of the problem. I'm
>     thinking of documents that are more form-like in nature, or at
>     least built up from smaller components - so it's not so much going
>     from Word to an internal format, as much as starting with XML or
>     JSON (or tuples), building up structure, and then adding
>     presentation at the final step.  XML -> Word is a lot easier than
>     the reverse :-)
>
>     On the other hand, I do have a bunch of applications in mind where
>     parsing Word and/or PDF would be very helpful - notably stripping
>     requirements out of specifications.  (I can't tell you how much of
>     my time I spend manually cutting and pasting from specifications
>     into spreadsheets - for requirements tracking and such.)  Again,
>     presentation isn't that much of an issue - structural and semantic
>     analysis is.  But, while important, that's a separate set of
>     problems - and there are some commercial products that do a
>     reasonably good job.
>
>
>         I want to parse Microsoft word files and PDF etc. and display
>         them in a format that is
>         recognisable and not too abhorrent to the user. I also want to
>         allow on-screen manipulation of
>         documents (in a browser) - all of this seems to require a mess
>         of Javascript (in the browser)and a mess of parsing programs
>         inn the server.
>
>         Before we can manipulate documents we must parse them and turn
>         them into a format
>         that can be manipulated. I think this is more difficult that
>         the storing and manipulating documents
>         problem. You'd also need support for full-text indexing,
>         foreign language and multiple character sets and so
>         on. Just a load of horrible messy small problems, but a
>         significant barrier to importing large amounts
>         of content into the system.
>
>         You'd also need some quality control of the documents as they
>         enter the system (to avoid rubbish in rubbish out), also to
>         maintain the integrity of the documents.
>
>
>     Again, for this problem space, it's more about building up complex
>     documents from small pieces, than carving up pre-existing
>     documents.  More like the combination of an IDE and a distributed
>     CVS - where fully "compiled" documents are the final output.
>
>
>
>         If you have any ideas of now to get large volumes of data into
>         the system from proprietary formats
>         (like ms word) I'd like to hear about it.
>
>
>     Me too :-)  Though, I go looking for such things every once in a
>     while, and:
>     - there are quite a few PDF to XML parsers, but mostly commercial ones
>
>
> Suck - then you have to buy them to find out if they are any good
>
>     - there are a few PDF and Word "RFP stripping" products floating
>     around, that are smart enough to actually analyze the content of
>     structured documents (check out Meridian)
>
>     - later versions of Word export XML, albeit poor XML
>
>
> Which sucks
>
>     - there are quite a few document analysis packages floating
>     around, including ones that start from OCR images - but they
>     generally focus on content (lexical analyis) and ignore structure
>     (it's easier to scan a document and extract some measure of what
>     it's about - e.g. for indexing purposes; it's a lot harder to find
>     something that will extract the outline structure of a document)
>
>
>     Cheers,
>
>     Miles
>
>
>
>     --
>     In theory, there is no difference between theory and practice.
>     In practice, there is.   .... Yogi Berra
>
>     _______________________________________________
>     erlang-questions mailing list
>     erlang-questions <mailto:erlang-questions>
>     http://erlang.org/mailman/listinfo/erlang-questions
>
>


--
In theory, there is no difference between theory and practice.
In practice, there is.   .... Yogi Berra


12