Events Supported by Event Monitoring

Table 1 Elastic Cloud Server (ECS)

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

ECS

Reboot ECS

rebootServer

Minor

The ECS was reboot

  • on the management console.

  • by calling APIs.

Check whether the reboot was performed intentionally by a user.

  • Deploy service applications in HA mode.

  • After the ECS starts up, check whether services recover.

Services are interrupted.

Start auto recovery

startAutoRecovery

Major

ECSs on a faulty host would be automatically migrated to another properly-running host. During the migration, the ECSs was restarted.

Wait for the event to end and check whether services are affected.

Services may be interrupted.

Stop auto recovery

endAutoRecovery

Major

The ECS was recovered after the automatic migration.

This event indicates that the ECS has recovered and been working properly.

None

Auto recovery timeout (being processed on the backend)

faultAutoRecovery

Major

Migrating the ECS to a normal host timed out.

Migrate services to other ECSs.

Services are interrupted.

Startup failure

faultPowerOn

Major

The ECS failed to start.

Start the ECS again. If the problem persists, contact O&M personnel.

The ECS cannot start.

GPU link fault

GPULinkFault

Critical

The GPU of the host on which the ECS is located was

  • faulty.

  • recovering from a fault.

Deploy service applications in HA mode.

After the GPU fault is rectified, check whether services are restored.

Services are interrupted.

FPGA link fault

FPGALinkFault

Critical

The FPGA of the host on which the ECS is located was

  • faulty.

  • recovering from a fault.

Deploy service applications in HA mode.

After the FPGA fault is rectified, check whether services are restored.

Services are interrupted.

Improper ECS running

vmIsRunningImproperly

Major

The ECS was faulty or the ECS NIC was abnormal.

Deploy service applications in HA mode.

After the fault is rectified, check whether services recover.

Services are interrupted.

Improper ECS running recovered

vmIsRunningImproperlyRecovery

Major

The ECS was restored to the normal status.

Wait for the ECS status to become normal and check whether services are affected.

None

Local disk failure

LocalDiskError

Major

Local disks used by the ECS were faulty.

Contact O&M personnel.

Local disks are unavailable.

VM faults caused by host process exceptions

VMFaultsByHostProcessExceptions

Critical

The processes of the host accommodating the ECS were abnormal.

Contact O&M personnel.

The ECS is faulty.

Note

Once a physical host running ECSs breaks down, the ECSs are automatically migrated to a functional physical host. During the migration, the ECSs will be restarted.

Table 2 Relational Database Service (RDS) — resource exception

Event Source

Event Name

Event ID

Event Severity

Description

Solution

Impact

RDS

DB instance creation failure

createInstanceFailed

Major

A DB instance fails to create because the number of disks is insufficient, the quota is insufficient, or underlying resources are exhausted.

Check the number and quota of disks. Release resources and create DB instances again.

DB instances cannot be created.

Full backup failure

fullBackupFailed

Major

A single full backup failure does not affect the files that have been successfully backed up, but prolong the incremental backup time during the point-in-time restore (PITR).

Create a manual backup again.

Backup failed.

Primary/standby switchover or failure

activeStandBySwitchFailed

Major

The standby DB instance does not take over workloads from the primary DB instance due to network or server failures. The original primary DB instance continues to provide workloads within a short time.

Check whether the connection between your application and the database is re-established.

None

Replication status abnormal

abnormalReplicationStatus

Major

The possible causes are as follows:

The replication delay between the primary instance and the standby instance or a read replica is too long, which usually occurs when a large amount of data is being written to databases or a large transaction is being processed. During peak hours, data may be blocked.

The network between the primary instance and the standby instance or a read replica is disconnected.

Submit a service ticket.

Your applications are not affected because this event does not interrupt data read and write.

Replication status recovered

replicationStatusRecovered

Major

The replication delay between the primary and standby instances is within the normal range, or the network connection between them has restored.

No action is required.

None

DB instance faulty

faultyDBInstance

Major

A single or primary DB instance was faulty due to a disaster or a server failure.

Check whether an automated backup policy has been configured for the DB instance and submit a service ticket.

The database service may be unavailable.

DB instance recovered

DBInstanceRecovered

Major

RDS rebuilds the standby DB instance with its high availability. After the instance is rebuilt, this event will be reported.

No action is required.

None

Failure of changing single DB instance to primary/standby

singleToHaFailed

Major

A fault occurs when RDS is creating the standby DB instance or configuring replication between the primary and standby DB instances. The fault may occur because resources are insufficient in the data center where the standby DB instance is located.

Submit a service ticket.

Your applications are not affected because this event does not interrupt data read and write of the DB instance.

Database process restarted

DatabaseProcessRestarted

Major

The database process is stopped due to insufficient memory or high load.

Log in to the Cloud Eye console. Check whether the memory usage increases sharply, the CPU usage is too high for a long time, or the storage space is insufficient. You can increase the CPU and memory specifications or optimize the service logic.

When the process exits abnormally, workloads are interrupted. In this case, RDS automatically restarts the database process and attempts to recover the workloads.

Instance storage full

instanceDiskFull

Major

Generally, the cause is that the data space usage is too high.

Scale up the instance.

The DB instance becomes read-only because the storage space is full, and data cannot be written to the database.

Instance storage full recovered

instanceDiskFullRecovered

Major

The instance disk is recovered.

No action is required.

The instance is restored and supports both read and write operations.

Kafka connection failed

kafkaConnectionFailed

Major

The network is unstable or the Kafka server does not work properly.

Check your network connection and the Kafka server status.

Audit logs cannot be sent to the Kafka server.

Table 3 Relational Database Service (RDS) — operations

Event Source

Event Name

Event ID

Event Severity

Description

RDS

Reset administrator password

resetPassword

Major

The password of the database administrator is reset.

Operate DB instance

instanceAction

Major

The storage space is scaled or the instance class is changed.

Delete DB instance

deleteInstance

Minor

The DB instance is deleted.

Modify backup policy

setBackupPolicy

Minor

The backup policy is modified.

Modify parameter group

updateParameterGroup

Minor

The parameter group is modified.

Delete parameter group

deleteParameterGroup

Minor

The parameter group is deleted.

Reset parameter group

resetParameterGroup

Minor

The parameter group is reset.

Change database port

changeInstancePort

Major

The database port is changed.

Primary/standby switchover or failover

PrimaryStandbySwitched

Major

A switchover or failover is performed.