Service Knowledgebase


Cari artikel


Children Articles


Engineering Updates

Date

22 August 2022

Date

22 August 2022

Current Status

Resolved

Ticket

SVCENG-606

Title

Web Rosalia tidak bisa dibuka

User Impact

Website memakan waktu diatas 1 menit untuk dibuka, tidak bisa dipakai customer

Root Cause

Tadi sempat tidak bisa diakses dikarenakan ada update penambahan tabel di odoo_conf dari sisi kernel yang berkaitan dengan pipeline kernel V2 dan diperlukan restart webservice karena ada proses yang terhenti saat kernel melakukan update table

Fix Performed

Restart webservice

Future Prevention

None. The risk of active development.

We already have automated run in place to check server speed every 3 hours.

Date

9 April 2022

Date

9 April 2022

Current Status

Resolved

Ticket

SVCENG-282

Title

GT06 Device Offline

User Impact

Device offline untuk GT06

Root Cause

Kita ada restart service berkala setiap hari, berhubung kernel kita tidak stabil.

Restart pagi ini tidak selesai dalam jangka waktu yang ditentukan (6 menit).

Fix Performed

Start service manually.

Future Prevention

Meningkatkan timeout service restart dari 6 ke 10 menit, yang sudah pasti aman.

Date

26 March 2022

Date

26 March 2022

Current Status

Resolved

Ticket

 

Title

System Offline for 2 Hours, 2AM-4AM Monday 28 Mar 2022.

User Impact

Offline total

Root Cause

Perlu upgrade server untuk handle DisHub request

Fix Performed

-

Future Prevention

-

Date

23 March 2022

Date

23 March 2022

Current Status

Resolved

Severity

High

Ticket

SVCENG-242 SVCENG-244

Title

Banyak Unit Offline

User Impact

Sekitar 30-50% armada offline, dan user tidak bisa menggunakan System dengan maksimal

Root Cause

Code update by Kernel Team untuk menangani data lag 5 menit.

Sudah di testing dan jalan di QC environment, tapi tidak jalan di Production, karena saat backlog diolah, banyak info “backlog” yang diolah dulu, kemudian info baru diantrikan, jadi seakan-akan device offline.

Fix Performed

  • Revert code ke code yang lama

  • Update timeout unit offline dari 10 menit ke 60 menit, sehingga jika terkendala signal jelek, tidak akan otomatis tertulis offline

Future Prevention

Sedang dicari cara, gimana kita bisa simulasi 5000 devices di QC kita, karena sering sekali issue tidak bisa direplicate di QC berhubung jumlah device tidak terlalu banyak.

 

Date

10 March 2022

Date

10 March 2022

Current Status

Resolved

Severity

High

Ticket

ENG-1317 (Resolved)

Title

DB Writing Bottleneck

User Impact

Live view data is lagging behind between 0-5 minutes

If this continues, system crash / data loss is a possibility.

Root Cause

On 18 Feb 2022, live view was updated to accomodate NCR request (on behalf of Rosalia), resulting in 40% increase in overall CPU usage in our infrastructure

Fix Performed

  • Separate NCR request into a different page, not inside live view, and make that custom page available only for them

  • Found other queries not related to NCR, that is causing error in our database, and fixed those too.

Future Prevention

Will need to discuss with Founder on how we are going to operate.

We cannot continue accomodate requests, without paying price for cost/performance.

 

Date

9 March 2022

Date

9 March 2022

Current Status

Resolved

Severity

High

Ticket

SERVICE-206 (Resolved)

Title

GT06N Devices offline for 12 hours

User Impact

Loss of data between 9 March 2022 18:00 - 10 March 2022 06:00, for GT06N devices

Root Cause

Engineer made a code change that results in an uncaught error

Fix Performed

Code was fixed by the Engineer

Future Prevention

  • Education provided to engineer to ensure code is tested before released to production

  • To create an automation to check for this kind of issue hourly, to ensure issue is caught early: ENG-1512

 

 

Date

07 April 2022

Date

07 April 2022