Every company is a data company, whether it’s Google, Meta, or Amazon. Every MNC and startup is ready to pay millions of dollars just to acquire more data, protect the integrity of their data, and derive business strategies from it. Google Cloud has the most advanced data engineering and analysis tools to date. Clients often migrate to Google Cloud to leverage industry leading tools like BigQuery.
Before making business decisions from data, you should be very comfortable with the storage options that Google Cloud offers. To see a detailed overview of storage options, head over to the Google Cloud Documentation.
If you don’t pass the Google Cloud Professional Data Engineer exam the first time, we’ll pay your retest fee.
See details
In this type of storage, every item is considered an object. You can store anything in it, whether it's an image, a video, or a csv file. The service that supports object storage in Google Cloud is Google Cloud Storage. The Buckets resource offered as a part of Google Cloud Storage is serverless by build, utilizes a pay-as-you-go model, and offers advanced features like Lifecycle and Retention Policy.
This option is also known as Block Storage. Persistent disks (PDs) are located independently from your virtual machine instances. Therefore, you can detach or move persistent disks to keep your data even after you delete your instances. The major use of persistent disks is to attach them to instances. PDs are advantageous because they are serverless, easy to attach, and have automatic security. As of now, Google Cloud supports four types of persistent disks: Balanced PD, Extreme PD, SSD PD, and Standard PD.
Google Cloud offers a range of options for obtaining and storing various types of data. You can use Cloud Firestore and Cloud BigTable for storing No-SQL Data, and Cloud SQL and Cloud Spanner for OLTP workloads. Furthermore, all these services come with advantages like serverless, autoscaling, cross-region replication, a pay-as-you-go model, and much more. The last option is BigQuery, which isn’t a storage option at all, but is instead used for data analysis and warehousing.
Storage features to keep in mind:
Next, let’s learn about the major types of data. With this information, you will be able to learn how to segregate an entire dataset into a predefined data type, use the right storage service, and later on derive key decisions from it.
This is a predefined data type, which means that before ingesting data you need to define the key parameters of the data, and it is stored in rows and columns. For example, if you wanted to create an employee database for your organization, you would need to define the employee name, employee ID, address, and other details. SQL is often used to maintain such databases. The advantages of using structured data are faster access, ease in scaling, and data warehousing. In general, online transactions and SQL Databases are categorized as this data type.
This is the most common type of data; around 70% to 80% of generated data falls under this category. Unstructured or unorganized data doesn’t follow a particular schema or order. Unlike structured data, you can’t use rows and columns to store information. We generally store this type of data in object storage (i.e., each entity is separately stored and accessed through keys or identifiers.) Some common examples of this type of data are videos and pictures. The major advantage of unstructured data is that, as they are schemaless, the data is portable and supports even structured data.
This category is a mixture of the structured and unstructured data format. The data stored in this format is neither row nor column based, nor object based. You should go with this data format when you figure out that your data has some common properties but that those properties are not fixed (e.g., if you are looking for a flexible schema.) Real life examples are XML files, zip files, and emails.
Once you have categorized your data, you'll need to find a suitable storage type.
CloudSQL is a cloud-based solution for MySQL, PostgreSQL, and SQL Server. It is one of the OLTP (Online Transaction Processing) databases provided by Google Cloud, and it supports ACID properties. With CloudSQL, you can create an instance of your desired machine type, store it, and create a public or private instance. You can even create a read replica with high availability and durability.
Note: CloudSQL is not a global service. It also supports advanced features like data backups, accidental deletion, and maintenance windows.
This is also an ACID supported database. The major advantage of choosing Cloud Spanner is that it is a fully managed, globally distributed database by design. Furthermore, Google Cloud provides an SLA of 99.999%, and it can scale up to a whopping 2TB of data per node. Just choose the location of your database (region or multi-region), adjust the number of processing units or nodes (1 node = 1000 processing units), define the schema, and enter the data. It’s that easy to use Cloud Spanner!
Cloud Firestore is a NoSQL cloud-based database solution, also known as a document database. Choose this type of storage when the schema isn’t fixed or the data has different attributes. Currently it supports two modes, namely native mode and datastore mode.
Note: Once you pick a mode, you won't be able to change it. For more differences between the two modes, you can refer to the table below.
BigTable is a wide-column, petabyte-scale managed database. Because it is built using a wide-column database, it comes with advantages like faster speed of querying and flexibility in schema appended by the pros of Cloud. Just like primary keys in relational databases, we use row keys in BigTable. Organizations generally use BigTable when they have to deal with large data streams like IOT sensors.
Note: Be very careful when designing the row keys for BigTable.
Now that you've reviewed data segregation and data storage in Google Cloud, let’s explore data analysis.
BigQuery is one of the industry leading OLAP (Online Analytics Processing) tools currently on the market. It is a fully managed, petabyte-scale, and cost-effective analytics data warehouse that allows you to run analytics over vast amounts of data in near real time. You don’t have to build a team to manage the storage, compute, and DR patterns. Google Cloud will take care of all of that for you.
BigQuery is often confused with an OLTP database solution. BigQuery is not meant for the same purpose as OLTP. The sole purpose of including BigQuery in your architecture should be for data analysis and data warehousing.
BigQuery ML is a subset of BigQuery where you can create machine learning models by using SQL queries. BigQuery Omni is a service that is widely used by hybrid-believers. It is used to run BigQuery on the data stored in third-party services like Amazon S3.
Schema is the blueprint of your entire database. It defines the number of attributes and the relationship between the attributes. In BigQuery, you can either define your own schema or you can use the auto-detect feature providing you hassle-free analysis of your data.
Datasets are high level entities that store and manage your data for analysis. You need to have at least 1 dataset in a project in order to start working with BigQuery. Furthermore, there are two types of datasets: Public and Private. Datasets store the data in tables that are used for analysis. Datasets can be regional or multi-regional.
A BigQuery table contains individual records organized in rows. Each record is composed of columns, also called fields. Every table is defined by a schema that describes the column names, data types, and other information. You can specify the schema of a table when it is created, or you can create a table without a schema.
As of now, BigQuery supports two types of query language: Standard SQL (ANSI compliant) and the Legacy SQL (may deprecate soon).
Federated queries are very beneficial when you have a Cloud SQL instance running and you want to perform some queries in real-time. To leverage, you create a query with the function EXTERNAL_QUERY(). The condition for using federated queries is that it requires a publicly accessible Cloud SQL instance. Learn more about federated queries here.
An authorized view lets you share query results with particular users and groups without giving them access to the underlying tables.
Follow these step-by-step instructions to start using Google's BigQuery for data analysis.
1. Log into your Google Cloud Account.
2. Search for BigQuery in the search bar and select it.
3. For this tutorial we will be using a public dataset. Click on add data to add a full, publicly available dataset.
4. From the variety of options, select public datasets.
5. Select the Covid-19 Public Datasets. These contain public data about Covid-19 cases, countries, and other key metrics.
6. In order to use this dataset, you need to make a copy of it.
7. Choose your project ID, a unique dataset ID, and the data location. You can even enable the table expiration period if you’d like. Note: You may need to enable some APIs during this process.
8. We have successfully made a copy of the database. In order to query the dataset, select the table with the name covid19_open_data and click on Query.
9. Let’s do some analysis on the data. Enter the query as written below.
SELECT place_id ,wikidata_id , country_code FROM `<PROJECT-ID>.<DATASET-ID>.covid19_open_data` WHERE location_key='US' LIMIT 100
10. One of the most impressive features of BigQuery is that, after you enter a valid query, it checks and returns the number of bytes that will be scanned in order to run the desired output. Note: In BigQuery, you are charged according to the number of bytes scanned.
11. You can see the output of the query in the results tab.
12. You can look at the stages, timing, and row reads and writes of the job in the execution details tab.
13. You can run the sample queries like below and play around with the dataset.
SELECT place_id, count(*) as ci FROM `<PROJECT-ID>.<DATASET-ID>.covid19_open_data` WHERE location_key='US' GROUP BY place_id LIMIT 100
SELECT population ,debt_relief, population_age_20_29 FROM `cloudinstitute-1.cloudinstitute_dataset.covid19_open_data` WHERE debt_relief=1 ORDER BY population DESC LIMIT 100
14. After running any query, you can even save the query and share it with your teammates.
15. Google Cloud offers a variety of options when it comes to your queries, like saving your output in multiple formats and even visualizing the data using tools like Data Studio.
16. We can even work with BigQuery using the bq CLI tool. For the rest of the tutorial, we will be using the cloud shell. Click on the terminal icon to enable it.
17. Use the bq ls command to list all the datasets in the current project.
18. Let’s create another dataset using the bq mk cloudinstitute_cli_dataset command.
19. Now perform some queries in the dataset we created using the console. Note: You can use the –dry_run parameter, which will tell you the number of bytes the process will need to scan in order to deliver the desired output.
bq query --use_legacy_sql=false --dry_run 'SELECT place_id , country_code FROM `<PROJECT-ID>.<DATASET-ID>.covid19_open_data` LIMIT 100'
20. It’s time to delete the created datasets using the bq rm <DATASET-ID> command to delete any dataset. If you want to delete the tables inside the dataset as well, use the -r flag.
21. Use the bq ls command again to make sure that there is no dataset present. Note: Delete the created resources as they may incur costs.
Now let's connect Google Sheets with BigQuery in the following steps.
1. Create a new spreadsheet or open any existing sheet from Google Sheets.
2. Under the data tab, click on data connectors and select BigQuery as the connector.
3. As soon as you select BigQuery as the connector, a pop-up window will open showing all the Google Cloud projects connected to your Gmail ID. Note: Make sure to create a dataset and a table in BigQuery, either through the console or the CLI before moving to the next step.
4. Click on connect and run your queries.
You can head over to this Developer Cheat Sheet developed by Google Cloud for a quick recap about the services.
Ready to start earning Google Cloud certifications? We’ll give you a coach and help you pass the exam the first time—guaranteed.