How to change a timezone on the Spark jobs

Purpose

This is a trivial skill to keep the data on the Hadoop, which will be migrating through the Spark jobs along with the Vietnamese timezone (UTC+7)

Issue

In this blog, I used pyspark of Python code to perform an ELT mechanism into Hadoop storage. The flow encountered an issue with the timezone on the hour column, which affected my business logic to store partition data on Hadoop. The partition is going to locate the structure in the following:

folder structure:
  year
    month
      day
        hour

This flow should be going to hour=14, but I got hour=7 as the following:

|    id |user_id|year|month| day|hour|
+-------+-------+----+-----+----+----+
|      1|      1|2025|    1|  1 |   7|
+-------+-------+----+-----+----+----+

Solution

In the my blog, I deployed Spark and Hadoop on the Virtual Machine (VM) along with system services. Therefore, I need to update the timezone on the VM in the following commands:

[root@datawarehouse ~]# timedatectl set-timezone Asia/Ho_Chi_Minh
[root@datawarehouse ~]# timedatectl
      Local time: Wed 2025-01-01 14:22:16 +07
  Universal time: Wed 2025-01-01 07:22:16 UTC
        RTC time: Wed 2025-01-01 07:22:16
       Time zone: Asia/Ho_Chi_Minh (+07, +0700)
     NTP enabled: no
NTP synchronized: yes
 RTC in local TZ: no
      DST active: n/a

=> Time zone: Asia/Ho_Chi_Minh (+07, +0700)

Then perform Stop & Start Spark services on the VM:

$ cd $HOME/Spark/spark-3.5.1-bin-hadoop3/sbin
$ ./stop-all.sh
$ ./stop-connect-server.sh

$ ./start-all.sh
$ ./start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.1

Outcome

Perform the ELT mechanism again with the Python code and got a result as following:

|    id |user_id|year|month| day|hour|
+-------+-------+----+-----+----+----+
|      1|      1|2025|    1|  1 |  14|
+-------+-------+----+-----+----+----+

</B>KeepTheSimpleWays !</B>

References

  • stackoverflow
  • openapi