Events Supported by Event Monitoring¶
Event Source | Event Name | Event ID | Event Severity | Description | Solution | Impact |
---|---|---|---|---|---|---|
ECS | Reboot ECS | rebootServer | Minor | The ECS was reboot
| Check whether the reboot was performed intentionally by a user.
| Services are interrupted. |
Start auto recovery | startAutoRecovery | Major | ECSs on a faulty host would be automatically migrated to another properly-running host. During the migration, the ECSs was restarted. | Wait for the event to end and check whether services are affected. | Services may be interrupted. | |
Stop auto recovery | endAutoRecovery | Major | The ECS was recovered after the automatic migration. | This event indicates that the ECS has recovered and been working properly. | None | |
Auto recovery timeout (being processed on the backend) | faultAutoRecovery | Major | Migrating the ECS to a normal host timed out. | Migrate services to other ECSs. | Services are interrupted. | |
Startup failure | faultPowerOn | Major | The ECS failed to start. | Start the ECS again. If the problem persists, contact O&M personnel. | The ECS cannot start. | |
GPU link fault | GPULinkFault | Critical | The GPU of the host on which the ECS is located was
| Deploy service applications in HA mode. After the GPU fault is rectified, check whether services are restored. | Services are interrupted. | |
FPGA link fault | FPGALinkFault | Critical | The FPGA of the host on which the ECS is located was
| Deploy service applications in HA mode. After the FPGA fault is rectified, check whether services are restored. | Services are interrupted. | |
Improper ECS running | vmIsRunningImproperly | Major | The ECS was faulty or the ECS NIC was abnormal. | Deploy service applications in HA mode. After the fault is rectified, check whether services recover. | Services are interrupted. | |
Improper ECS running recovered | vmIsRunningImproperlyRecovery | Major | The ECS was restored to the normal status. | Wait for the ECS status to become normal and check whether services are affected. | None | |
Local disk failure | LocalDiskError | Major | Local disks used by the ECS were faulty. | Contact O&M personnel. | Local disks are unavailable. | |
VM faults caused by host process exceptions | VMFaultsByHostProcessExceptions | Critical | The processes of the host accommodating the ECS were abnormal. | Contact O&M personnel. | The ECS is faulty. |
Note
Once a physical host running ECSs breaks down, the ECSs are automatically migrated to a functional physical host. During the migration, the ECSs will be restarted.
Event Source | Event Name | Event ID | Event Severity | Description | Solution | Impact |
---|---|---|---|---|---|---|
RDS | DB instance creation failure | createInstanceFailed | Major | A DB instance fails to create because the number of disks is insufficient, the quota is insufficient, or underlying resources are exhausted. | Check the number and quota of disks. Release resources and create DB instances again. | DB instances cannot be created. |
Full backup failure | fullBackupFailed | Major | A single full backup failure does not affect the files that have been successfully backed up, but prolong the incremental backup time during the point-in-time restore (PITR). | Create a manual backup again. | Backup failed. | |
Primary/standby switchover or failure | activeStandBySwitchFailed | Major | The standby DB instance does not take over workloads from the primary DB instance due to network or server failures. The original primary DB instance continues to provide workloads within a short time. | Check whether the connection between your application and the database is re-established. | None | |
Replication status abnormal | abnormalReplicationStatus | Major | The possible causes are as follows: The replication delay between the primary instance and the standby instance or a read replica is too long, which usually occurs when a large amount of data is being written to databases or a large transaction is being processed. During peak hours, data may be blocked. The network between the primary instance and the standby instance or a read replica is disconnected. | Submit a service ticket. | Your applications are not affected because this event does not interrupt data read and write. | |
Replication status recovered | replicationStatusRecovered | Major | The replication delay between the primary and standby instances is within the normal range, or the network connection between them has restored. | No action is required. | None | |
DB instance faulty | faultyDBInstance | Major | A single or primary DB instance was faulty due to a disaster or a server failure. | Check whether an automated backup policy has been configured for the DB instance and submit a service ticket. | The database service may be unavailable. | |
DB instance recovered | DBInstanceRecovered | Major | RDS rebuilds the standby DB instance with its high availability. After the instance is rebuilt, this event will be reported. | No action is required. | None | |
Failure of changing single DB instance to primary/standby | singleToHaFailed | Major | A fault occurs when RDS is creating the standby DB instance or configuring replication between the primary and standby DB instances. The fault may occur because resources are insufficient in the data center where the standby DB instance is located. | Submit a service ticket. | Your applications are not affected because this event does not interrupt data read and write of the DB instance. | |
Database process restarted | DatabaseProcessRestarted | Major | The database process is stopped due to insufficient memory or high load. | Log in to the Cloud Eye console. Check whether the memory usage increases sharply, the CPU usage is too high for a long time, or the storage space is insufficient. You can increase the CPU and memory specifications or optimize the service logic. | When the process exits abnormally, workloads are interrupted. In this case, RDS automatically restarts the database process and attempts to recover the workloads. | |
Instance storage full | instanceDiskFull | Major | Generally, the cause is that the data space usage is too high. | Scale up the instance. | The DB instance becomes read-only because the storage space is full, and data cannot be written to the database. | |
Instance storage full recovered | instanceDiskFullRecovered | Major | The instance disk is recovered. | No action is required. | The instance is restored and supports both read and write operations. | |
Kafka connection failed | kafkaConnectionFailed | Major | The network is unstable or the Kafka server does not work properly. | Check your network connection and the Kafka server status. | Audit logs cannot be sent to the Kafka server. |
Event Source | Event Name | Event ID | Event Severity | Description |
---|---|---|---|---|
RDS | Reset administrator password | resetPassword | Major | The password of the database administrator is reset. |
Operate DB instance | instanceAction | Major | The storage space is scaled or the instance class is changed. | |
Delete DB instance | deleteInstance | Minor | The DB instance is deleted. | |
Modify backup policy | setBackupPolicy | Minor | The backup policy is modified. | |
Modify parameter group | updateParameterGroup | Minor | The parameter group is modified. | |
Delete parameter group | deleteParameterGroup | Minor | The parameter group is deleted. | |
Reset parameter group | resetParameterGroup | Minor | The parameter group is reset. | |
Change database port | changeInstancePort | Major | The database port is changed. | |
Primary/standby switchover or failover | PrimaryStandbySwitched | Major | A switchover or failover is performed. |