RAS information

RAS information#

AMD SMI: RAS information
RAS information

Functions

amdsmi_status_t amdsmi_get_bad_page_threshold (amdsmi_processor_handle processor_handle, uint32_t *threshold)
 Get the bad page threshold for a device. More...
 
amdsmi_status_t amdsmi_get_gpu_cper_entries (amdsmi_processor_handle processor_handle, uint32_t severity_mask, char *cper_data, uint64_t *buf_size, amdsmi_cper_hdr_t **cper_hdrs, uint64_t *entry_count, uint64_t *cursor)
 Retrieve CPER entries cached in the driver. More...
 
amdsmi_status_t amdsmi_get_afids_from_cper (char *cper_buffer, uint32_t buf_size, uint64_t *afids, uint32_t *num_afids)
 Get the AFIDs from CPER buffer. More...
 
amdsmi_status_t amdsmi_get_gpu_ras_feature_info (amdsmi_processor_handle processor_handle, amdsmi_ras_feature_t *ras_feature)
 Returns RAS features info. More...
 
amdsmi_status_t amdsmi_get_gpu_ras_policy_info (amdsmi_processor_handle processor_handle, amdsmi_gpu_ras_policy_info_t *info)
 Get the RAS policy info for a device. More...
 
amdsmi_status_t amdsmi_get_gpu_bad_page_info (amdsmi_processor_handle processor_handle, uint32_t *bad_page_size, amdsmi_eeprom_table_record_t *bad_pages)
 Returns the bad page info. More...
 

Detailed Description

Function Documentation

◆ amdsmi_get_bad_page_threshold()

amdsmi_status_t amdsmi_get_bad_page_threshold ( amdsmi_processor_handle  processor_handle,
uint32_t *  threshold 
)

Get the bad page threshold for a device.

Platform:

gpu_bm_linux

host

Given a processor handle processor_handle and a pointer to a uint32_t threshold, this function will retrieve the bad page threshold value associated with device processor_handle and store the value at location pointed to by threshold.

Note
This function requires the admin/sudo privileges on
Platform:
gpu_bm_linux
Parameters
[in]processor_handlea processor handle
[in,out]thresholdpointer to location where bad page threshold value will be written. If this parameter is nullptr, this function will return AMDSMI_STATUS_INVAL if the function is supported with the provided, arguments and AMDSMI_STATUS_NOT_SUPPORTED if it is not supported with the provided arguments.
Returns
amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail

◆ amdsmi_get_gpu_cper_entries()

amdsmi_status_t amdsmi_get_gpu_cper_entries ( amdsmi_processor_handle  processor_handle,
uint32_t  severity_mask,
char *  cper_data,
uint64_t *  buf_size,
amdsmi_cper_hdr_t **  cper_hdrs,
uint64_t *  entry_count,
uint64_t *  cursor 
)

Retrieve CPER entries cached in the driver.

Platform:

gpu_bm_linux

host

guest_1vf

The user will pass buffers to hold the CPER data and CPER headers. The library will fill the buffer based on the severity_mask user passed. It will also parse the CPER header and stored in the cper_hdrs array. The user can use the cper_hdrs to get the timestamp and other header information. A cursor is also returned to the user, which can be used to get the next set of CPER entries.

If there are more data than any of the buffers user pass, the library will return AMDSMI_STATUS_MORE_DATA. User can call the API again with the cursor returned at previous call to get more data. If the buffer size is too small to even hold one entry, the library will return AMDSMI_STATUS_OUT_OF_RESOURCES.

Even if the API returns AMDSMI_STATUS_MORE_DATA, the 2nd call may still get the entry_count == 0 as the driver cache may not contain the serverity user is interested in. The API should return AMDSMI_STATUS_SUCCESS in this case so that user can ignore that call.

Parameters
[in]processor_handleHandle to the processor for which CPER entries are to be retrieved.
[in]severity_maskThe severity mask of the entries to be retrieved.
[in,out]cper_dataPointer to a buffer where the CPER data will be stored. User must allocate the buffer and set the buf_size correctly.
[in,out]buf_sizePointer to a variable that specifies the size of the cper_data. On return, it will contain the actual size of the data written to the cper_data.
[in,out]cper_hdrsArray of the parsed headers of the cper_data. The user must allocate the array of pointers to cper_hdr. The library will fill the array with the pointers to the parsed headers. The underlying data is in the cper_data buffer and only pointer is stored in this array.
[in,out]entry_countPointer to a variable that specifies the array length of the cper_hdrs user allocated. On return, it will contain the actual entries written to the cper_hdrs.
[in,out]cursorPointer to a variable that will contain the cursor for the next call.
Returns
amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail

◆ amdsmi_get_afids_from_cper()

amdsmi_status_t amdsmi_get_afids_from_cper ( char *  cper_buffer,
uint32_t  buf_size,
uint64_t *  afids,
uint32_t *  num_afids 
)

Get the AFIDs from CPER buffer.

Platform:

gpu_bm_linux

host

guest_1vf

guest_mvf

A utility function which retrieves the AFIDs from the CPER record.

Parameters
[in]cper_buffera pointer to the buffer with one CPER record. The caller must make sure the whole CPER record is loaded into the buffer.
[in]buf_sizeis the size of the cper_buffer.
[out]afidsa pointer to an array of uint64_t to which the AF IDs will be written
[in,out]num_afidsAs input, the value passed through this parameter is the number of uint64_t that may be safely written to the memory pointed to by afids. This is the limit on how many AF IDs will be written to afids. On return, num_afids will contain the number of AF IDs written to afids, or the number of AF IDs that could have been written if enough memory had been provided. It is suggest to pass MAX_NUMBER_OF_AFIDS_PER_RECORD for all AF Ids.
Returns
amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail

◆ amdsmi_get_gpu_ras_feature_info()

amdsmi_status_t amdsmi_get_gpu_ras_feature_info ( amdsmi_processor_handle  processor_handle,
amdsmi_ras_feature_t ras_feature 
)

Returns RAS features info.

Platform:

gpu_bm_linux

host

guest_windows

Parameters
[in]processor_handleDevice handle which to query
[out]ras_featureRAS features that are currently enabled and supported on the processor. Must be allocated by user.
Returns
amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail

◆ amdsmi_get_gpu_ras_policy_info()

amdsmi_status_t amdsmi_get_gpu_ras_policy_info ( amdsmi_processor_handle  processor_handle,
amdsmi_gpu_ras_policy_info_t info 
)

Get the RAS policy info for a device.

Platform:

gpu_bm_linux

host

Given a processor handle processor_handle, this function will retrieve the RAS policy information for the device.

Parameters
[in]processor_handlePF of a processor for which to query
[out]policy_infoRAS policy info for the device. Must be allocated by user.
Returns
amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail

◆ amdsmi_get_gpu_bad_page_info()

amdsmi_status_t amdsmi_get_gpu_bad_page_info ( amdsmi_processor_handle  processor_handle,
uint32_t *  bad_page_size,
amdsmi_eeprom_table_record_t bad_pages 
)

Returns the bad page info.

Platform:
host
Parameters
[in]processor_handlePF of a processor to query.
[in]array_lengthLength of the array where the library will copy the data.
[out]bad_page_countNumber of bad page records.
[out]infoReference to the eeprom table record structure. Must be allocated by user.
Returns
amdsmi_status_t | AMDSMI_STATUS_SUCCESS on success, non-zero on fail