ZG Session | Building a Pipeline for Near Real Time Reporting at Comcast

The three-person team at Comcast tasked with making business applications data ready for analysis was challenged with providing 1300 users with the data they needed in a timely manner, while the data was still relevant. The problem was two-fold: 1) the data warehouse was only refreshed once a day, meaning that the data in the batch overnight reporting was up to 24-hours old by the time it was delivered to the business; and 2) it was taking developers up to two weeks to make the changes needed to deliver new content.

During this session Prabha Kuppusamy, Senior BI Developer at Comcast, walks you through how Comcast reimagined their data pipeline to make the transition from batch overnight reporting to refreshing data every 15 minutes by eliminating time-consuming transformation and aggregation processes. She also discusses how they are now able to deliver all usable data for analysis down to the transaction-level details, while significantly reducing development time even for the most complex use cases.

Transcript:

Prabha: Thank you. Thank you Incorta for the wonderful opportunity. Imagine if you were able to refresh 150 ETL workflows every 15 minutes and be able to report in seconds on top of 400 million rows data set. Hi, I'm Prabha Kuppusamy, senior bi developer at Comcast. And today I'm going to be talking about the implementation of near real time reporting done at Comcast using Incorta. So as part of this presentation, I'm going to be covering our journey from OBI to Incorta, the implementation, Comcast schemata signed the challenges that we have faced and the solutions that we have implemented to overcome them, and the overall benefits of OBI over to Incorta.

Start with Comcast Corporation, as you may all know, is a global media and technology company. We are focused in streaming broadband, and aggregation with over 56 million customer relationships across the United States and Europe. Since 2014, our small team of nine members have been supporting around 2000 financial users with their reporting needs, which includes our team maintaining 3 OBI instances, 3 ETL/DAC instances, and 3 data warehouse instances. We source about 90% of our data from source from Oracle EBS, and the remaining 10 percentage from sources like Workday and Kronos via CSV files. In about 2018, we knew we needed to review the other Enterprise BI tools in the market for two main reasons. One, our OBI footprint, all the components within the OBIA were all nearing end of life cycle. And also to though we have had learned success with using OB features and functionality.

Like because of the main challenges like user interface and the reporting rendering time. Like this was this tool was seldom used more for an extract and pivot purpose. And also the other disadvantages that we have seen with OB is like our overnight processing, like our data availability. Overnight process used to take around eight to nine hours in order to bring the day minus one data into the warehouse. And two our real time reports used to take a little longer than its usual time around the business peak hours and three with the development, like we had to spend more time because of the involvement of multiple components like Informatica, DAG, data warehouse, OBI, RPD and OBI frontend. With Incorta, we saw a clean and fast user interface, which will not only allow us to move forward with our business by providing the users with the existing reports run much faster and better, but also allow them to slice and dice the data instead of just sifting through the details.

So when we have decided to move from OBI over to Incorta, the main challenge we have had in front of is was to get a 15 minutes data refresh for our general ledger module in order to support our accounting users with their monthly softclose accounting activity. Now, what is softclose? Softclose that happens between 22nd to end of month at Comcast freezes the accounting transaction that hits the profit and loss accounts, thereby allowing these accounting users to reconcile and report their data for their department for that particular period. Now, around these time, this particular window of 22nd to end of month, these setup accounting users will fire the OB real time queries as in when needed to make their financial decisions. Now this ability for them the near real time in answer the real time feature in OB for sure demanded or near real time reporting to be done at Incorta.

So with OBS, real time report run time threshold of 15 minutes our business have agreed to have the data refreshed every 15 minutes. Now that's it's a first requirement have the data refreshed every 10 minutes. And the second thing is a general ledger reporting is made of made of two datasets. One, all the journals from the general ledger tables and two to pull the account payables journals from the sub ledger accounting. But from the recruiting standpoint, our users requirement was to have these two merged into one operational view so they will have the flexibility to run one report and pull all the journals irrespective of the journal sources. And that's the second requirement, which would support around 20 analytical reports for a data volume of 400 million rows, which is for like last few years. And the ability to report at a highly detailed level and the next one ability to report like slice and dice the data by month, year and quarter by various GL attributes. Were not kind of feasible to OBIEE due to the data volume. Fifth one, fifth requirement.

At Comcast, we protect or secure our financial data at the business unit level, which requires for this particular requirement to also refresh the security table every 15 minutes along with the actual data refresh. The last one is to reduce the number of reports when we moved from OBI to Incorta in order to lower the maintenance efforts. And so with these six requirements, like these were the main six requirements with which the major focus being 15 minutes refresh. So, when we have signed off these requirements from the user, like we have laid out a high level design that would look something like this. So for this particular implementation, our EBS is going to be our only source.

And we have decided to capture the data set for the ledger module in two different schemas. Because like, we have two different data frequencies for these two data sets. Like all the journals that comes in via data set one, which is a general ledger tables is going to be refreshed every 15 minutes once and the second data set, which is getting collected from the sub ledger accounting tables, which is nothing but our account payable journals will get refreshed only on a conditional basis. So we have decided to set up a Cron tab, job polling of every five minutes for schema one, which would pull every five minutes and trigger the schema every 15 minutes, say if the schema has got done at like 16 minutes, it will wait for next four minutes. And then like the poll will again trigger the schema.

And for data set two we have set up a conditional refresh polling, meaning this polling will every five minutes once hit the source database, and will only trigger the schema only when it finds an incremental refresh data available with the Oracle source and if not, that's schedule will be skipped and it will keep on looking for an incremental data. So this is kind of ad hoc, as in when the data becomes available at the ESB, please go ahead and trigger else like keep pulling. And then we have Incorta in the middle where we have that one operational view which would match both the datasets and are users at the other end. So with this high level design, we have in the next few slides, I'm going to be discussing about the schema design that we have done the physical schema design that we have done and the challenges that we have faced and then the solutions we have implemented.

So this is the final version of physical schema design for our general ledger model that we have at Comcast right now. And this is approximately the simplified form of around 150 ETL Informatica workflows, which now comprises of only like 10 base tables in its original form, these are like the blue ones are all the base tables in its original form. These are like Oracle EBS tables, we have not done any kind of transformation here, and 10 to 12 alias tables. And on the right hand side, the yellowish ones are florent tables hierarchy tables, which are designed using Incorta analyze the tables. And then the green ones, the bottom green ones are the MVs. The materialized views, these is the one we place within the entire schema design where we have done like a minor or I would say like 10% of transformation to have our data available for the reporting end.

So this XLA brush is one that one operational view that we discussed in the previous slide, which is kind of merging all the journals from these base tables, which is under EBS, GL General Ledger tables. And also it brings in the XLA the account payables data from the XLA module. And this merges the information and keep it in one place for the reporting to support like 20 analytical reports for this module. And the second table here is the materialized view, which is the security materialized view which supports our security like to secure our data at the business unit level.

I will discuss about this dependent schema tables in the next slide. So as I have mentioned, this is like the final version of the schema that we have in place right now at Comcast. But before we reached the spot like we have had to we have had our schema in different layouts, different stages, we have run into like various issues and then we had to take different steps in order to overcome those challenges and then like meet our 15 minutes goal because like our ultimate goal is to have the intermodal refresh under 15 minutes. So the first one is that we have faced moving the infrequently updated data set to a separate schema. Yes. So here, account payables like since this operational view is merging data from data set two to data set one, we know that at Comcast and data set two will be ad hoc and it will not be more than six times a day, which means which is a clear indication that it is not that frequently updated as compared to our main requirement of every 15 minutes.

So, we have made sure that none of the AP tables, none of the infrequently updated tables stays within 15 minutes reporting schema refresh. And second thing is moving huge table joins into the materialized view itself, we have seen spot to perform better with heavy tables and heavy joints than Incorta. So we have made sure that we are moving all these heavy joints and tables into the materialized view for spark to process them and dump them into the tables for Incorta to just access it. And using input analyzer tables, wherever applicable. This is like we have at least as a developer, I have realized that this is the most effective way to effectively utilize the SPARC resources. And like when with the latest version, Incorta 515. The Incorta analyzer table data is also getting stored in packet tables, which is a nice feature. So then like you can start referencing them in all your reporting from the front end.

And the second thing is schema refresh interruption. So our main goal for this near real time reporting is to have the entire general ledger module refreshed within that 15 minutes window. So like before reaching this final stage, we have had our schema in different parts, as I've already mentioned. So the one of this part was we have hired all these flattened tables, which supports our hierarchy information under a different schema called think common or like I would say for like the general purpose, I'd call it as a dependence schema or parent schema.

So we were not having these tables within the reporting schema, we're having all these tables within another schema parent schema. And as part of every 15 minutes afresh, the goal was to refresh the reporting schema, which will take anywhere from 13 to 14 minutes, and then go on to refresh the parent schema which should take from four minutes to five minutes, which is adding up to like close to 17 minutes, which is not helping us. And the other issue that we are faced here is since we have had this is like this is Comcast Standard Hierarchy information, which means like it should be used in other modules like account payables purchase orders, racks, capex, all the other modules, which means it also had links or joints to all the other modules in addition to GL reporting schema.

So these two, like one we were not able to meet 15 minutes mark and two, because of its links to all the other schemas every 15 minutes refresh was kind of causing user interruption from the front end. So we decided to duplicate those flattened tables into the reporting schema itself and have the entire schema that the current the links to the other schemas by doing this, and also we were able to achieve that 15 minutes mark.

And third thing is transformation phase tuning. So as I have mentioned this one operational view up here, like which merges the data from GL as well as from the XLA side, meaning like GL data set, one will pull all the journals and also the same journal will be available from data set two. So the moment that a journal becomes the same journal becomes available from dataset two, we have a strategy within this key within this materialized view, which would mark the same journal from dataset one as soft deleted, otherwise, it would become duplicates. So we would hold the journal from dataset one, until the same becomes available from data set two. The moment it becomes available, we'll make it as soft deleted. So over a period of time, these Soft deleted records kept increasing and which was which was causing a high, which was causing a primary key indexing issue, which was having a direct impact to our runtime.

So we decided to go for a full load only on that particular table in order to clean up that soft deleted records and make our MV run better. And the second thing over here is, as a developer, I have felt that using max of like using greater than question mark performs much better than like using max of refresh time, using max of refresh time is a developer's way to like enforce Incorta to just scan through the entire row set and then find the last refresh time where it's like using greater than question mark would just fetch it from the file and then append it for the incremental load. Fourth one - extraction tuning.

So we all know that Incorta, only issues like Select Star and specific tables to the Oracle database or any source database for that matter, but around like peak business hours, when there is high contention at the database end, we have seen like our extraction time has, like gone up by two times or by three times. So we know that we needed some kind of strategy in order to optimize the queries that are running at the Oracle end, so we came up with an idea of adding parallel hands, adding appropriate index based column filters and adding appropriate source data filters. In order to make the extraction queries run better at the source database end. And this is the materialized view tuning. And with this materialized View like, we have always had to play with the spark configurations. But with this materialized view from the beginning like it was not like one go where like I would say, hey, you know what, like, the executor course is always going to stick with 15 and we driver cores are always going to stick with 10 something like that. So as in when we had to, like seen better see better performance with this MV, we have always had to play with the spark configuration for that particular MV.

So the best technique that has given us good performance is the shuffle and repartition technique, which is kind of instructing the spark to blow up to the explore the number of partitions, processed the data, transform the data and then repartition, rematch and hand it over to Incorta which had shown us very good positive impact from the runtime standpoint. And also we have tried to utilize like the right implemented data filter. And also we have made sure to apply right data filters in order for spark to spend less time in scanning through the huge tables. And the last one in the series is post load tuning. post load tuning like the first issue that we have faced with post load tuning is when we realized that the cross schema formulas that we have returned within the reporting schema tends to take a lot of time.

And when we found that we tried to simplify them by moving the cross schema formulas into the parent schema itself, so it won't get refreshed every 15 minutes once. But it would get refreshed with the time the parents can get refreshed. And then the GL reporting schema will only be referencing that. And also we have made sure to reduce the number of alias tables which were referenced in the formula calculation and also removing all the unnecessary joins. So these were like the six major pain points that we have had with implementation and then the solutions that we have implemented to have our GL schema refresh every 15 minutes. Now at Comcast after doing all these changes, we are now able to meet that 15 minutes. And the next few slides are going to be like I'll be covering the overall benefits that we have seen from OBI over to Incorta.

So from speed of development standpoint, OBI in order to in order for us to implement a complex task being a developer like I used to take like 8.5 days, and again the reason is because of the involvement of multiple components I have to spend like five days in Informatica, one day entire ETL, shell scripting and point five days in the data warehouse changes on two days in OBIEE and the front end, whereas I Incorta we are down to one day, like I have to spend some time in physical schema, business schema, and then like the reporting, that's it. So which has cut our down time from 8.5 to one day, which is like, we are giving back 7.5 days, I'm saving my 7.5 days and I can utilize it to develop other things.

And from data availability standpoint, we have gone from mostly updating once a day or real time in OBIEE instance over to like every 15 minutes in case of on your real time implementation to max of four times a day or like we have scheduled it based on the data availability of that particular module from the source end. And from data load wins standpoint, the huge load when that we have noticed is with fixed asset module back in OBIEE like it used to the full load once a month is used to take like 17 hours, which means we used to spend like eight hours in order to bring the data from Oracle EBS into the staging layer. And we have to let our overnight process run because like we cannot interrupt the overnight process. And then the next following day we would run which would take the data that was stored in the staging layer into the reporting so around like all around like it would take around 17 hours, which would disrupt our business like our fixed asset users because like they will have to wait at least 24 hours for them to see the module closed are accounted data for fixed asset modules.

In OBI reporting, whereas in Incorta are followed now it's taking an hour and 20 minutes and also we have the flexibility of running the fixed asset modules multiple times during the day. A project management is down from eight hours to 30 minutes and also the same flexibility of once a day to four times a day in Incorta, employee expenses is down from six hours to 30 minute payroll and other bi weekly business critical load is down from three hours to 30 minutes. The last aspect of benefits is the report rendering time this is my personal most favorite aspect from Incorta and for the implementation that we have done. We have just discussed GL. So GL line details payables data is getting refreshed every 15 minutes and the report in OB used to take an average of 22 minutes for 2k rows whereas in Incorta it is down to seconds in order to render one month worth of data which is around 80 million plus also, it helps us to do like the aggregate preview, like we can report at like your level month level quarter level, it's not that line level.

And GL line details average of one minute for 7k rows is now down to seconds for one month, which is around like 80 million plus aggregates. GL journal source by category, this report like is to kind of time out because like this is an aggregated data at journal source by category. So we're just in Incorta, which is now approximately takes a minute to gather the same 80 million rows per segregates. Revenue YTD, so this reporting OBIEE like used to work for a couple of months in the year start. And then as in when the data gets accumulated more, it would stop working. There is Incorta, it is now getting done under less than two minutes in order to process like 144 million rows. And CAPEX transaction details, this is one of our most used report from OBIEE world and also an incredible it used to take an average of 70 minutes in order to fetch 10k rows, whereas in Incorta it is getting done under a minute.

The other major benefit that we have seen from reporting and is creating reconciliation dashboard for our users. So we have created revenue to GL, inventory to GL reconciliation dashboards. What it does is, like revenue is an interface for the Oracle EBS. So users would always want to see hey, what I have interfaced versus what has been posted in the general ledger module. So these kinds of reporting were not possible in OBI due to the data volume, because like in revenue is so huge as far as the GL. But in Incorta, we have created one dashboard and we have created insights wherein like on the on one side will have all the revenue interface data for that particular period. And then the other side will have whatever we interface via revenue and got posted into GL on the other side. So users will be able to compare side by side without having to download and do all the Vlookup functions that they were doing previously.

And the last thing here is like we were the short survey that were that was taken at Comcast, majority of the users feel that they'll be able to save two to five hours monthly and the next largest group feels that they'll be able to save five to eight hours monthly. So yes, now we with all these changes that we have done like we are now able to achieve our 15 minutes refresh with general ledger module and also all the other modules are getting refreshed as frequently as needed. Thank you.

Building a Pipeline for Near Real Time Reporting at Comcast

Speaker:

Prabha Kuppusamy