Intro to Data

Lakshmi Shiva Ganesh Sontenam
6 min readApr 23, 2022

Data never sleeps!

Graphic: Business Wire

The rate at which we create information has been growing for years at a more-or-less predictable rate. As such, we can get a decent idea of how much data will exist in the world a few years from now. There are unbelievably 1,200 petabytes (1200 * 10 power 15 bytes) of information that Google, Microsoft, Amazon, and Facebook store. Although this number sounds amazing, all the data across the rest of the Internet that gets generated every minute can be depicted from the above diagram. With the aid of 5G networks, IoT devices, and web 3.0 (backed by blockchain technology) worldwide internet penetration will continue to surge and so does data.

The relevant data stats agree that there should be around 175 zettabytes (175 * (10 power 21 bytes)) of data by 2025. It’s a number that’s hard to envision. And with countless advanced technological devices available now, processing these massive amounts of data is not impossible.

Photo: Sontenam, Lakhsmi Shiva Ganesh

Traditional data tend to be measured in gigabytes (10 power 9 bytes) and terabytes (10 power 12 bytes). As a result, they are typically managed using a centralized architecture, which could be more cost-effective and secure for smaller, structured data sets.

  • Traditional data sets that are structured are generally stored on OLTP systems. Online transaction processing (OLTP) captures, stores, and processes data from transactions in real-time — designed for RDBMS systems and firmly guarantees Consistency and Availability.
  • An OLTP system captures and maintains transaction data in a database. Each transaction involves individual database records made up of multiple fields or columns. The focus is on fast processing. These databases are read, written, and updated frequently. If a transaction fails, built-in system logic ensures data integrity. Examples include banking and credit card activity or retail checkout scanning.
  • Products: MySQL, Oracle, SQL Server, IBM DB2, SQLite, MariaDB, Amazon RDS, Azure SQL, etc.

Big data is distinguished not only by its size but also by its volume, velocity, variety, value, and veracity — The 5 V’s of big data. It is usually measured in petabytes, exabytes(10 power 18 bytes), or zettabytes (10 power 21 bytes). As a result, they are typically managed using a de-centralized & distributed architecture, which could get more and more expensive to accumulate data and maintain it.

  • Big data sets that are structured are generally stored on OLAP systems. Online analytical processing (OLAP) uses complex queries to analyze aggregated historical data from OLTP systems typically — designed for Big Datawarehouse/NoSql systems and firmly guarantees Availability and Partition Tolerance.
  • OLAP uses complex queries to enormous amounts of historical data aggregated from OLTP databases and other sources for data mining, analytics, and business intelligence projects. The focus is on response time to these complex queries. Each query involves one or more columns of data aggregated from many rows. Query failure does not interrupt or delay customer transaction processing, but it can delay or impact the accuracy of business intelligence insights. Examples include year-over-year financial performance or marketing lead generation trends.
  • Products: Teradata, Snowflake, Google BigQuery, Amazon Redshift, Azure Synapse Analytics, etc.

Typical Data Processing Cycle:

Extract -> Transform -> Load

The increasingly large size of big data sets(both structured & unstructured) is one of the main drivers behind the demand for utilizing Hadoop-like commodity cluster hardware, high-capacity, cloud-based data storage solutions (data lake), and other modern data stacks for building the bigdata architectures (that supports ETL — Extracting, Transforming & Loading the data).

Check my post below to gain insight into big data in motion, its architecture, and engineering.

Although this post does not cover any details on how the big data could be ingested and processed in an enterprise data platform, it does give you some idea of how the data would be organized and analyzed afterward.

Organization

Traditional data, due to its accommodating size can easily be organized as structured data in the form of files, and tables. Fields in traditional data sets can be effortlessly relational, so it’s possible to work out their relationship and manipulate the data accordingly. Traditional databases, such as SQL, Oracle DB, and MySQL, use a fixed schema that is static and preconfigured.

Big data uses a dynamic schema. In storage, big data would mostly be raw and unstructured. When big data is accessed, the dynamic schema is applied to the raw data. Modern non-relational /NoSQL databases like Cassandra and MongoDB are ideal for semi-structured data. Data lake storage like HDFS, Azure BLOB Storage/ADLS, or AWS S3 is ideal for unstructured data, given the way they store data in files.

Analysis

Traditional data analysis occurs incrementally: An event occurs, data is generated, and the analysis of this data takes place after the event. Traditional data analysis can help businesses understand the impacts of given strategies or changes on a limited range of metrics over a specific period. For example, it can identify how much sales have increased during that specific period.

Big data analysis can occur in real-time. Because big data generates on a second-by-second basis, analysis can occur as data is being collected. Big data analysis offers businesses a more dynamic and holistic understanding of their needs and strategies. For example, it can identify the specific areas that have been impacted, such as sales, customer service, public relations, and more.

SQL (Structured Query Language) is the most well-known standard for both managing and analyzing the data held in a relational database system (either oltp or olap warehouse). It is particularly useful in handling structured data, i.e. data incorporating relations among entities and variables.The scope of SQL includes data query & analysis, data manipulation (insert, update and delete), data definition (schema creation and modification), and data access control.

SQL Basics — cheatsheet by https://learnsql.com/
SQL Basics — Table Joins
SQL Basics— cheatsheet by https://learnsql.com/
SQL Aggregate Functions Vs. Window Functions
SQL Advanced— cheatsheet by https://learnsql.com/
SQL — Sequence of Operations & Built-in Functions

Stay tuned to my future posts regarding the roadmap, roles & responsibilities, skillsets, salaries, tech stack, and tutorials in the above-mentioned fields. Follow me:

https://shivaga9esh.medium.com/

https://www.linkedin.com/in/shivaga9esh/

Let’s connect on discord group learning, daily questions & clarifications: https://discord.gg/wePCJpbrVy

Working hard to keep it simple and keeping it real :)

--

--

Lakshmi Shiva Ganesh Sontenam

A man should hear a little music, read a little poetry, and see a fine picture every day of his life.