[Architecture] Revamping BAM 2.0 Data Model & Analytics

Tharindu Mathew tharindu at wso2.com
Wed Jan 25 06:07:00 EST 2012


I took a look at Hive + Cassandra and it isn't straight forward... there
will be some work...

OTOH, if we want to do something like....

get the value of field "foo" add 5 and put it back... is there a way to do
this in sql?

We already do it with BAL (BAM Analytics language)... So is it a good idea
to support more than one language...

On Wed, Jan 25, 2012 at 4:17 PM, Srinath Perera <srinath at wso2.com> wrote:

> +1 on all counts --Srinath
>
> On Mon, Jan 23, 2012 at 5:21 PM, Tharindu Mathew <tharindu at wso2.com>
> wrote:
> > These are some thoughts based on a brain storming session we had after
> > looking at the CEP model of defining streams.
> >
> > Right now, BAM 2.0 has defined a model that is very generic, which allow
> you
> > to perform any type of analytics successfully, but usability wise and
> design
> > wise the data model and analytics structure falls short.
> >
> > There are three parts to this discussion:
> >
> > Type definition
> > Streams & Data model
> > Analytics
> >
> > 1. Type Definition
> >
> > With the unification of the data agents to also be compatible with
> Siddhi,
> > we have decided to send type definition request as part of the session
> > initialization mechanism. This allows the data rate to be faster, but
> allows
> > additional benefits for analytics to which type safety can be applied.
> >
> > A streamId will also be sent as part of the event now, and we will make
> use
> > of the concept of streams as defined in the next section.
> >
> > Ex: service_data : { int, int, int, float, float, float }
> >
> >
> > 2. Streams & Data model
> >
> > Right now, all events are stored into 3 base tables (Table is used as the
> > appellation as opposed to Column Family, for the benefit of RDBMS users)
>  in
> > BAM 2.0. So, 2 completely different event sources such as order data
> from a
> > retail store and service level statistics will be stored in the same
> table.
> >
> > ex: EVENT : {
> >       retail_uuid_1 : { retail_event_key1 : retail_event_value1 ...... }
> >       service_stats_uuid_1 : { service_stats_key1 : service_stats_value_1
> > .... }
> >       retail_uuid_2 : { retail_event_key1 : retail_event_value1 ...... }
> >       service_stats_uuid_2 : { service_stats_key1 : service_stats_value_1
> > .... }
> >
> > We would like to move this model from storing all this data into one
> table
> > to storing in one table per stream. A stream is defined as a set of
> similar
> > type of events.
> >
> > Now extending the same example the data would be. Table name would be
> stream
> > name.
> >
> > Retail_Store : {
> >     retail_uuid_1 : { retail_event_key1 : retail_event_value1 ...... }
> >     retail_uuid_2 : { retail_event_key1 : retail_event_value2 ...... }
> > }
> >
> > Service_Stats : {
> >     service_stats_uuid_1 : { service_stats_key1 : service_stats_value_1
> ....
> > }
> >     service_stats_uuid_2 : { service_stats_key1 : service_stats_value_2
> ....
> > }
> > }
> >
> > We believe this will be cleaner, and make it easier for users as it maps
> > more to a table. And with a type data associated, it will be easier for
> > users to be notified of user errors and construct analytics. Also, we
> will
> > not sacrifice on flexibility as we had earlier because, every new stream
> > will be a new table, allowing for KPI definition.
> >
> > We have the ability to allow for stream definitions, as well as construct
> > dynamic streams based on received events. Dynamic stream definitions
> should
> > be the preferred option, IMO. Otherwise, it might be too painful to start
> > collecting data for a new event stream.
> >
> > 3. Analytics
> >
> > The ultimate goal of both type definitions and streams would be to make
> > analytics easier to write.
> >
> > Now an analytics query will be associated with a stream, avoiding each
> query
> > having to run through all the events that come into BAM. Also, with a
> query
> > being associated with a stream the type information is readily available,
> > allowing to warn/disallow incorrect queries.
> >
> > Orthogonally, with a more standard data structure it allows us to use a
> more
> > SQL like language. Since, BAM currently has it's own language, the plan
> is
> > to deprecate this and move to a SQL based language. A major reason for
> this
> > is to avoid complications in using a new language such as language
> > unfamiliarity. Also, considering the audience of those who are in the
> > analytics domain, most are familiar with SQL based languages. As scalable
> > analytics is a primary requirement of BAM, the natural choice for this
> will
> > be HiveQL. We hope to evaluate the feasibility of Hive and make it the
> > default mode to run analytics. The default mode will run in Hadoop local
> > mode, and will be packed to work OOTB using the work done for
> Carbon-Hadoop
> > integration.
> >
> > All the points above will move towards a cleaner data model and easier
> > analytics.
> >
> > Any feedback on the above thoughts are appreciated.
> >
> > --
> > Regards,
> >
> > Tharindu
> >
> > blog: http://mackiemathew.com/
> > M: +94777759908
> >
> >
> > _______________________________________________
> > Architecture mailing list
> > Architecture at wso2.org
> > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
> >
>
>
>
> --
> ============================
> Srinath Perera, Ph.D.
>    http://www.cs.indiana.edu/~hperera/
>    http://srinathsview.blogspot.com/
> _______________________________________________
> Architecture mailing list
> Architecture at wso2.org
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>



-- 
Regards,

Tharindu

blog: http://mackiemathew.com/
M: +94777759908
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.wso2.org/pipermail/architecture/attachments/20120125/26ac7d32/attachment-0001.html>


More information about the Architecture mailing list