Technical Story Telling: Oracle Data Cloud

 

I have been used to regular reorganizations (reorgs) as part of Microsoft stint.  After 15 Months into Oracle, just when I was wondering how come there are no reorgs in Oracle Corporation, I got a mail in a weeks time in May, 2014 that we have been reorganized into a new division named Oracle Data Cloud (ODC) headed by Oracle’s latest acquisition Blue Kai head Omar Tawakol.  Telepathy was in works, I suppose Smile.

Omar had a clear business vision on how the ODC should evolve and he has worked tirelessly to acquire another company Datalogix which augmented the product scope very well.  With three companies, namely Blue Kai – Datalogix – Collective Intellect, Oracle Data Cloud has got a solid product scope, huge market share and till today occupies leader quadrant. 

Vijay Amrit Agrawal has been recruited back into Oracle just after an year and has been tasked to setup a team in India under ODC, namely ODC India Team.  Here is a technical story that I have used while recruiting developers into Oracle Data Cloud.  Below, I have given a long story, but usually I have shortened the story appropriately based on the candidate’s experience and exposure.

 

Story – Blue Kai:

Google was a well known brand ~15 years ago.  Everyone who wanted to research about anything, went straight to google.com and searched.  It is this behavior that made Google a leader in digital ad domain.   As people visited the sites via Google search and results, Google knew exactly what the user was looking for and gave a high quality advertisements in the web sites ad panels or areas.  Google has been respected because of their non-intrusive simple-text ads because their quality was so good.  Around ~5 years back, FlipKart, Amazon, SnapDeal became well known brands.  Users who wanted to buy any product, went directly into these sites and searched leaving Google (or any other search engine) clueless on what the user was looking for.   Slowly, the ad quality of Google reduced as more and more sites became well known.  Google also became increasingly clueless, as browsers became advanced and helped the user hit the sites directly because they already have them in search history or have them bookmarked.  Google realized this soon and came-up with a new browser Google Chrome so that it can capture every site visited directly from browser w/o Google search.  While this worked for logged in users, it did not work for others.  Many sites feared sharing the sites search data to Google as they could be eaten by Google the giant.  Blue Kai realized this lack of trust between digital world is a huge opportunity to tap into.  Blue Kai came up with a smart data-sharing win-win proposition and silently worked with many online properties (sites) to buy into this business model.  Before we get into that business model, let’s ask few questions so that we can appreciate what is the unique business model that Blue Kai has offered that many bought into.

Would Amazon, FlipKart, Google share their search data to each other

Your answer would most probably be “No” to above questions.  But, with Blue Kai in the game, the answer is “Yes”!.  How? you may ask!

You share your data to me and I will give you all others data.  In this data sharing, the source site of the data is not maintained.  That is, when I share you others data I can’t tell you from where I got it.  Similarly, when I share your data to the world, I can’t tell them from where it is received.  But, I can tell you which user machine the search data is for.

Soon, Blue Kai has become the online data sharing hub for many online sites (guess, what that “may” means in number – millions of sites!).   OK, how does it work?

You go to Amazon.in and search for a TV.  You open another tab and hit FlipKart.  How would it be, if FlipKart displayed a set of TV Offers?  You open another tab and browsed a technical blog on Apache Kafka, and on a side panel Google displayed TV ads.  How would that be?

That’s the power that Blue Kai brings to all these sites.  It is not some offline data sharing, but real-time web-scale few micro seconds away sharing of the data.  Google has to process only its search data, Bing has to process its own search data, we at Blue Kai has to process every internet sites every search data in real-time and share the data with in few micro seconds to all others when asked.

Thanks to Blue Kai, Google ad quality has problem has been solved!  Once we have so much data, we surely know how to make money out of it.  We do charge the sites in this data-sharing model based on different business scenarios.

 

Story – Datalogix:

Chief Marketing Officers (CMOs) around the world started doubting the whole digital advertisement spend and it’s yield.  In case of TV ads, user’s attention is guaranteed as long as the user stays on the channel.  Where as, with non-intrusive ads in online world, it is not very clear for a CMO if the ad has been exciting for the users.  Google countered it with pay-per-click model of pricing where advertiser has to pay only when the ad is clicked and not just when it is merely displayed.  While this prove that user did see the ad based on click stats, it is still not clear if there is any increase in sales, especially if the advertisements are related to Offline world.  For example an ad like “Reliance Digital offers 35% discount on all Sony TVs”.  Where it is very hard to assess the offline store sales of Sony TVs  CMOs started asking why should we put in so much money if there is no increase in sales? If there is indeed increase in sales, what is the volume?  Is it worth?  How can one compute the net increase in sales as a result of digital ad campaign? 

Welcome to Datalogix, we at Datalogix solve this problem.  Datalogix is an interesting company in that it acts as a bridge between online and offline worlds.  Datalogix buys offline stores sales data in aggregated fashion without any identify of the buyer.  Here is an example record from Reliance Digital Store:

Store Location, Company Name, Product Name, Model, Week Number, Sales Volume

Kondapur, Sony, LED TV, Bravia 1234, 1, 10

Kondapur, Sony, Smart TV, Bravia 5678, 1, 5

Which store would not be happy to make money by giving such data which is not revealing any buyer identity?  Datalogix got this offline sales data from every offline store possible.

Now, it started pitching in to advertisement platforms such as Google, Bing, etc. that you share me your advertisement footprint, I would prove (or disprove) whether your advertisement footprints translates to online + offline sales.  Of course, advertisement platforms would be happy to be proved that digital advertisement works so that they can take this proof to CMOs. 

When a user browses any site where advertisements are displayed.  Assuming, the ads are by Google. Google would share the footprint of the ad such as

Location of Browsing Computer, Advertiser Company, Advertised Product Advertisement Served Time

Kondapur, Reliance Digital, LED TV, 2016-01-01 01:01

Kondapur, Reliance Digital, LED TV, 2016-01-01 02:02

Datalogix does a big-data-join between offline sales data and online advertisement foot print and proves (or disproves) if there is any correlation between online advertisement footprints volume vs. sales volume (by region).  This big-data-join, as you can see, involves aggregating data by region, product, week, etc. 

Advertiser can verify the Datalogix findings by talking to offline stores in the area that Datalogix proves/claims has seen increase in sales volume.    Datalogix being another company not associated with any advertisement platform is regarded well for its non-partiality and advertiser ability to verify the findings, make Datalogix a trust worthy.

We at Oracle Data Cloud are smart in making money.  We make money pre-advertisement by sharing the ad target data (Blue Kai) and we make money post-advertisement by helping prove sales yield and so advertisement quality (Datalogix). 

 

Story – Collective Intellect:

Blue Kai and Datalogix work very well as long as things are searched.  However, all that system fails if there is no searching involved, but just a textual discussion in online world – be it discussion boards, forums, social sites, etc.  That gap is filled by Collective Intellect.  Which I have covered in my previous blog post here.

 

(Disclaimer: Brands used in this post are just an example)

Technical Story Telling: Collective Intellect

 

Collective Intellect is a Boulder, CO, US start-up that Oracle has acquired in June, 2012.  I quit Microsoft and joined Oracle to work for this team in January, 2013.  Here is a technical story that I have used while recruiting developers into the team.

Story:

Bhargavi wanted to buy a Television (TV), so she went into doing a research by searching, reading articles, browsing different sites, different comparisons such as features, technologies.  She being a money-conscious person compared prices between different e-commerce sites such as FlipKart.com, Amazon.in, SnapDeal.com, etc.  She being a thorough researcher also researched which e-commerce sites are better, which seller is better, etc.  Finally, arrived at a TV model Samsung Smart LED TV 1357, e-commerce site, seller and purchased it.  Everyone around her appreciated her decision.

Bhavani also wants to buy TV, but she is not a researcher and trusts friends and family.  She reached out to her friends on Facebook about her desire to buy a TV as “Hi Friends, I want to buy a TV.  Any recommendations?”.  Bhargavi is a friend of Bhavani, saw the post and responded with details of her recent research, and recommendation of TV Samsung Smart LED TV 1357.

Saraswathi is another friend of Bhavani, saw this conversation and she chimed-in and responded about her ordeal  with he TV she has recently purchased.  “Hi Bhavani – I recently bought Sony Smart LED TV 2468 and I strictly recommend that you *not* go for this model”.  

 

Problem(s):

Businesses around the world want to improve their sales.  To improve their sales, they need to improve products.  To improve the products, businesses need feedback on their products.  What’s good and what’s bad about their product.  Businesses also want to reduce the damage due to their “bad” side of the product (of previous version).  Learning what’s not going good about their product in non-internet age was through feedback forms, etc.  In early internet age, businesses have sent e-mails after some time of the purchase (typically a month or two) to fill an online form.   The problem with this feedback is at the instant the form was filled, which might wary due to ongoing usage of product and it’s performance.  For example, a customer may be very happy after 1 month of usage.  But, after 6-months the same customer might be completely upset about the purchase because of other issues that have cropped up.  Knowing these issues and addressing them is very important for a business to succeed.  Reaching out periodically over an e-mail may not work out as it may be regarded as overreaching and or even spamming!.  The unhappy customers are vocal and not only the business has lost him/her as a customer, but because of their vocal nature potential future customers also refrain from their products.  In the current internet age, customers share their feedback in variety of ways – blogs, discussion boards and forums, review sites, ranking sites, social sites, etc. at the very instant they are unhappy about.  If a business can address the unhappy customer at the right time, damage can be controlled by a great margin.  Imagine United Airlines had a way to get notified about this video posted by their customer whose guitar was mishandled plus staff indifference towards the issue before going viral, how powerful would that be?

The problem with online world is there is too much of data for any company to handle.  Lot of that data is not relevant to a business.  Extracting out the relevant data (signal) out of so much data (noise) is a software problem and not a business problem (unless business is also about a software). 

Adding to this, issues of a product may not be global but local, may be due to the local environment or manufacturing site, or some other.  Aggregating and drilling into this information at ease would be a huge plus for any business.  

 

Solution:

Collective Intellects collects the textual data from all the online media such as blogs, news, forums, boards, review sites, social sites, etc.   Analyze all the data (mostly noise) to extract important information (signal).  In this process, the product drops most of the data as the online is full of conversations that are not related to businesses using Natural Language Processing (NLP) domain algorithms. 

In the above story, the discussion is around Entity “TV”.  Bhavani’s post contains “Purchase” language ,  Saraswathi’s response contains a “Support” language, Bhargavi’s response contains “Promotion” language.  It is these meanings that Collective Intellect identifies and then shares the information to TV Companies if they are our customers.   Customer’s can then route this information to appropriate departments with in their company, such as “Support” language events becoming a Support ticket, “Purchase” Language becoming a “Sales” lead, “Promotion” language becoming a “Loyalty” Program.

To give you an idea of this working in real world, have you observed this in Facebook?

Laxmi: I am completely fed-up with my Vodafone connection.  While my SIM Card works well, my wife’s SIM Card does not work in the same house.

Vodafone Customer Care:  Dear Laxmi, We are really sorry for the inconvenience caused.  Can you please share more details of the problem such as which SIM Card is working and which is not,  the locality, etc. so that we can dig deep.

Laxmi: Here are the details.  Working SIM: 1234567890, Failing SIM: 9876543210, Location: Hyderabad

Vodafone Customer Care:  Based on our backend analysis, we think that SIM Card is corrupted.  We dispatched a new SIM Card to you, please try and let us know. 

Laxmi: I received the SIM Card and it works well.  Thanks for resolving the issue.

Wondered, how can Vodafone Customer Care know that Laxmi has posted about them of all the Facebook users and also a particular post that is targeting them of all the posts Laxmi made?  The magic behind that is the products like Collective Intellect.

Technical Story Telling

I have been part of recruiting engineers into different engineering teams I have worked for or associated with.  During the interviews, we have to explain what a company and/or a product does/offers, being technical people we tend to use many concepts of the functional domain.  While it works in some cases, it does not work all the times.  Especially if the candidate is not from the same domain, this job-sale-pitch results in a flop-show than a hit-show.  We are all humans first before we become engineers, so I found story telling technique works for technical stuff very well.  Trying to create a story for all our technical stuff not only helps in recruitment but also in explaining our work/company/product to family, relatives, friends.  This is an introduction article for my blogs on Technical Story Telling.  Actual stories would come soon.

Thanks,

Laxmi Narsimha Rao Oruganti

Integration – [Processor, Memory] Vs. [Visiting Places, People]

Today I want to try out explaining the integration between processor and memory using visiting places and people as the reference point.

Have you ever observed the in and out queues at various visiting places like large Zoo, famous Temple?  Can you reason out why they are the way they are? 

Zoo – Multiple entry gates and multiple exit gates

Temple – One entry gate and one exit gate

Why does not a temple have multiple gates?  Why does not a Zoo have only gate?

Zoo – After the entry gate, the possible ‘views’ are many.  One can go to animals view, another can go to birds view, yet another can go to trees view.  The more the views possible, the better consumption of people into views.  So, allowing more people at a stretch does not hurt but makes the system better.  Less entry gates only increase the queue lengths and results in insufficient Zoo usage.

Temple – The only ‘view’ is holy deity.  There is only one ‘view’.  So, allowing more people is going to make the situation worse.  You know how good the humans are at self disciplineSmile (Of course, there are exceptions). 

What does that observation bring, the in-flow and out-flow must be designed with actual ‘view’ or ‘consumption’ system in mind.  Any superior in-flow (many entry gates) designed without thinking of main consumer (temple) is going to create a mess. Any inferior in-flow (few entry gates) when main system (Zoo) is heavy consumption ready reduces the usage efficiency.

When it comes to computers, processor and memory are designed the same way. 

Processors are designed with ‘words’ pattern than ‘bytes’ pattern.  For example, you hear 16-bit processors, 32-bit processors, 64-bit processors.  So, 16-bit, 32-bit, 64-bit are words.  Processors process a word at a time.  The registers, algorithmic logic unit, accumulator, etc. all are in sync with ‘word’ pattern. 

Let us come to the memory and see a bit more into it. 

Byte addressable memory is a memory technology where every byte can be read/written individually w/o requiring to touch other bytes.  This technology is better for software as multiple types can be supported with ease.  For example: extra small int (1 byte), small int (2 bytes), int (4 bytes), long int (8 bytes), extra long int (16 bytes) can all be supported with just ‘length of the type in memory’ as design point.  No alignment issues, like small int must be on 2-byte address boundary, int must be on 4-byte address boundary, and so on so forth.  Surely, from the software point of view, byte addressable memory is a right technology.  But this memory is a bad choice for processor integration. 

Word addressable memory is a memory technology where one can read/write only word at a time.  This is better for processor integration as processors are design with ‘word’ consumption.  But, it suffers from memory alignment issue being surfaced to software and have to be dealt at these layers.  They also bring challenges like Endianness problem with different ‘word’ patterns (in processors).

From processor-memory integration, ‘word addressing’ wins.  From software-memory integration, ‘byte-addressing’ wins.

Hardware is manufactured in factories (and is hard to change post the fabrication).  Where as, software is more tunable/changeable/adaptable – change one line of code and recompile,  the change is ready in hands (deployment is a separate issue though).  So, the choice is on our face.  That is, choose the right memory for processors and let the problems be solved at upper layers like software, compilers.

So, compilers came up with techniques like padding.  Compilers also support packing to help developers make choice and override compiler inherent padding behaviors.

With all that understanding, let us take an example of simple primitive and reason to understand all these design choices.

Memory Copy:  Copy byte values from one memory location to another memory location

Signature: memcpy(source, sourceOffset, target, targetOffset, count)

It is very common for any program require copying of bytes from one location to another location (network stack is famous example).  In a simplistic code, memory copy primitive should be like (data types, bounds checking, etc. are excluded for brevity):

for (int offset = 0; offset < count; offset++) 
    target[targetOffset + offset] = source[sourceOffset + offset]

As a software programmer w/o knowing underlying design details, this looks like correct and performant code.  Well, software engineers are smart Smile and would love to learn.  We know that SDRAM is the memory technology, and the hardware is ‘word’ based.  That means, even if I were to read byte at address ‘x’ – the underlying hardware is going to fetch ‘word’ at a time into processor.  Processor then extracts the required byte (typically using ALU registers) from that word and passes the byte to the software program. 

What does this mean to the above code?

Assume source, target offsets are aligned on word boundary.  Let us say, word is 64-bit.

When for loop offset = 0, target memory location bytes from sourceOffset + offset to sourceOffset + offset + 8 are read (that is, one word).  Because the software requires (or asks) only first byte, it is extracted and other bytes are thrown away.  Again when for loop offset = 1, same location is read again from RAM, but a different byte (second byte) is extracted and given to software.  So on so forth, till offset = 7.

So, for offset = 0 to offset = 7 – the code is inherently reading the same word from RAM for 8 times.  So, why not fetch only once and use it in a single shot. Well that is what, memcpy primitive code does (a learned programmer’s code).  Here is a modified version:

// Copy as many ‘words’ as possible

for (int offset = 0; offset < (count – 8); offset+= 8)
    *(extra long int *) (target + targetOffset + offset) = *(extra long int *) (source + sourceOffset + offset)

// Copy remaining bytes that do not complete a ‘word’

for (/* continue offset value */; offset < count; offset++)
    target[targetOffset + offset] = source[sourceOffset + offset]

 

Well in reality, the memcpy code is not as simple as above.  Because, target and source offsets might be such a way that they are not word aligned.  If I am not wrong, memcpy could actually have assembly code directly (and some implementation does have assembly code).  After all, it is all about mov (to move word), and add (to increment offset) instructions (I remember my 8086 assembly programming lab sessions!).

This padding and packing also are super-important when one is worried about performance.  Padding helps in having the content/data/variables aligned.  Otherwise, efficient code like above won’t be useful at all and results in performance issues.

That is all for now, thanks for reading.  If you like it, let me know through your comments on blog.  Encouragement is the secret of my energy Smile.

Thanks,

Laxmi Narsimha Rao Oruganti (alias: OLNRao, alias: LaxmiNRO)

Of Home and Shelves

For a change I want to talk about my home. Well not really, it is my attempt to explain memory (volatile and persistent) hierarchy in a computer.

Have you ever observed a kitchen (or dressing room) in your home?  Let us take a look at it to understand and articulate.

1) Small Shelve (< 10 jars) very near to stove – We usually keep tea power, sugar, salt, popu (పోపు), chilly power, oil in this shelve

2) Medium Shelve (10 jars to 50 jars) in near reach of our hands, but little away from stove – We usually keep all other raw materials such as Red gram dall, Black gram dall, Idli Rawa, Bombay Rawa, Wheat Rawa, Vermicelli, etc. and utensils such as plates, glasses, utensils, etc. in this shelve

3) Big shelve (storage for cartons, gunny bags, drums, etc.) that are beyond reach of our hands and require ladder – We usually keep storage items such as rice bags, oil carton, unused utensils or rarely used items (dinner sets), etc. in this shelve

4) Apart from the shelves, we also having work area near the stove where items are brought for a temporary purpose and are placed back in their respective places when we are done with them.

I hope you now get the rough idea of how kitchen is organized.  Here are the top level reasoning of organization of items.

– The time order to bring items near to stove is: Small Shelve, Medium Shelve, Big Shelve

– The more frequently you use an item, the near it is brought to the stove – Salt, Chilly Power, etc.

– The bigger the item – we split into parts and keep one part handy to use and keep remaining parts farther away – Rice bags are split into small portion (10 KG) and rest.  Small portion is kept in reach and the rest can be kept far

Let us imagine a kitchen, where things are not organized by above pivoting rules/guidelines, what would be side effect?  We will end-up taking more time to prepare the food than in the current model of organization.

Now let us analyze stove.  Let us say, I have a stove ‘A’ that can cook a curry in an 30 minutes if all items are supplied w/o any delay (ideal case).  Because there will be delay to bring the required ingredients and mix with curry, we need to count that.  Here we might be lowering the stove flame to account for ingredient transport. Let us say, we spend 30 minutes in getting the ingredients for ready use.  So, we can prepare one curry with stove ‘A’ in one hour.  I became rich and can afford a better stove ‘B’ that can prepare a curry in just 20 minutes (in ideal case).  That is, actual curry preparation time would be 50 minutes (with transport time of 30 minutes).  Even richer me, better stove ‘C’ with 10 min. preparation time (ideal) – it would take 40 minutes in total. 

Ouch, we are severely inefficient in bringing the ingredients near the stove.  Hey, let us say you get an assistant to help.  You seek help of an assistant and reduce the ingredient transport time to 20 minutes (from 30 minutes).  So, with stove ‘C’, assistant help, we will be able to finish curry in 30 minutes.  Wow, yummy curry in 30 minutes!

Well, we become smart at doing things upon practice.  So, we learn which ingredients are required at what time, so why wait till that time?  That is, we could predict the ingredients and time map, keep them ready at the right time, so that we make sure we don’t unnecessarily occupy the place near the stove.  With this prediction, let us say we reduce ingredient transport to 10 minutes.  That means, yummy curry preparation with stove ‘C’ and prediction is just 20 minutes!

Well, I hope with that gyan of home, kitchen, shelves, stoves, etc.  Let us compare this whole system with computer and esp. memory hierarchy.

1) Small Shelves – On-Chip Processor Cache

2) Medium Shelves – RAM

3) Big Shelves – Hard Disk

4) Work area – Registers (AX, BX, CX, DX, AC, IP, etc.) + Processor Cache

5) Stove – CPU

6) Prediction – Instruction Pipeline, Branch Prediction

What shall we keep in processor cache? – Most frequently accessed data.  With ‘prediction’ smartness, we keep ‘next required’ data.

What shall we keep in RAM? – What we use periodically, not super frequently, but moderately required such as currently running program’s required code pages, usual program runtime data, etc.

What shall we keep on Disk? – Big files, videos, binaries, etc.  If we need any file, we don’t load whole file into RAM but get parts of it (much like we get 6-10 KG of Rice from rice bag)

As we have found in the discussion, the more we spend time in ingredient transport the more we need to use stove at lower flames.  That means, the more time it takes to get data from RAM/Disk, the less efficient use of CPU.  Performance experts refer to it as ‘memory wall’. 

Like we become smart in predicting which ingredients are required at what time to avoid using stove with lower flames in cooking, modern processors have techniques like branch prediction to reduce memory wall problem for CPU processing.

If we have 200 KGs of rice (4 Rice bags of 50 KG each), we might keep 3 full rice bags in big shelve and have another shelve where we could keep ‘currently’ used rice bag which is farther away than medium shelve but does not require a ladder.  Some times, we can have kitchen w/o big shelve at all in which case we don’t have much storage space and so we wont keep that many bags of rice. 

As much as we can add more types of shelves, similar stuff is applicable to computers. We now have Solid State Drives, that come between RAM and Hard Disk.  We can have a computer w/o Hard Disk, it just means that we don’t have much storage space.

With all this kitchen discussion, I am feeling hungry.  Will be back with another article on technology in simple English.  Till then…

Laxmi Narsimha Rao Oruganti