================= Troubleshooting ================= Slow/stuck operations ===================== Sometimes CephFS operations hang. The first step in troubleshooting them is to locate the problem causing the operations to hang. Problems present in three places: #. in the client #. in the MDS #. in the network that connects the client to the MDS First, use the procedure in :ref:`slow_requests` to determine if the client has stuck operations or the MDS has stuck operations. Dump the MDS cache. The contents of the MDS cache will be used to diagnose the nature of the problem. Run the following command to dump the MDS cache: .. prompt:: bash # ceph daemon mds. dump cache /tmp/dump.txt .. note:: MDS services that are not controlled by systemd dump the file ``dump.txt`` to the machine that runs the MDS. MDS services that are controlled by systemd dump the file ``dump.txt`` to a tmpfs in the MDS container. Use `nsenter(1)` to locate ``dump.txt`` or specify another system-wide path. If high logging levels have been set on the MDS, ``dump.txt`` can be expected to hold the information needed to diagnose and solve the issue causing the CephFS operations to hang. .. _slow_requests: Slow requests (MDS) ------------------- List current operations via the admin socket by running the following command from the MDS host: .. prompt:: bash # ceph daemon mds. dump_ops_in_flight Identify the stuck commands and examine why they are stuck. Usually the last "event" will have been an attempt to gather locks, or sending the operation off to the MDS log. If it is waiting on the OSDs, fix them. If operations are stuck on a specific inode, then a client is likely holding capabilities, preventing its use by other clients. This situation can be caused by a client trying to flush dirty data, but it might be caused because you have encountered a bug in the distributed file lock code (the file "capabilities" ["caps"] system) of CephFS. If you have determined that the commands are stuck because of a bug in the capabilities code, restart the MDS. Restarting the MDS is likely to resolve the problem. If there are no slow requests reported on the MDS, and there is no indication that clients are misbehaving, then either there is a problem with the client or the client's requests are not reaching the MDS. .. _cephfs_dr_stuck_during_recovery: Stuck during recovery ===================== Stuck in up:replay ------------------ If your MDS is stuck in the ``up:replay`` state, then the journal is probably very long. The presence of ``MDS_HEALTH_TRIM`` cluster warnings can indicate that the MDS has not yet caught up while trimming its journal. Very large journals can take hours to process. There is no working around this, but there are things you can do to speed up the process: Temporarily disable MDS debug logs by reducing MDS debugging to ``0``. Even with the default settings, the MDS logs a few messages to memory for dumping in case a fatal error is encountered. You can turn off all logging by running the following commands: .. prompt:: bash # ceph config set mds debug_mds 0 ceph config set mds debug_ms 0 ceph config set mds debug_monc 0 Remember that when you set ``debug_mds``, ``debug_ms``, and ``debug_monc`` to ``0``, if the MDS fails then there will be no debugging information that can be used to determine why fatal errors occurred. If you can calculate when ``up:replay`` will complete, restore these configurations just prior to entering the next state: .. code:: bash ceph config rm mds debug_mds ceph config rm mds debug_ms ceph config rm mds debug_monc After replay has been expedited, calculate when the MDS will complete the replay. Examine the journal replay status: .. code:: bash $ ceph tell mds.:0 status | jq .replay_status { "journal_read_pos": 4195244, "journal_write_pos": 4195244, "journal_expire_pos": 4194304, "num_events": 2, "num_segments": 2 } Replay completes when the ``journal_read_pos`` reaches the ``journal_write_pos``. The write position does not change during replay. Track the progression of the read position to compute the expected time to complete. The MDS emits an `MDS_ESTIMATED_REPLAY_TIME` warning when the act of replaying the journal takes more than 30 seconds. The warning message includes an estimated time to the completion of journal replay:: mds.a(mds.0): replay: 50.0446% complete - elapsed time: 582s, estimated time remaining: 581s .. _cephfs_troubleshooting_avoiding_recovery_roadblocks: Avoiding recovery roadblocks ---------------------------- Do the following when restoring your file system: * **Deny all reconnection to clients.** Blocklist all existing CephFS sessions, causing all mounts to hang or become unavailable: .. prompt:: bash # ceph config set mds mds_deny_all_reconnect true Remember to undo this after the MDS becomes active. .. note:: This does not prevent new sessions from connecting. Use the ``refuse_client_session`` file-system setting to prevent new sessions from connecting to the CephFS. * **Extend the MDS heartbeat grace period.** Doing this causes the system to avoid replacing an MDS that becomes "stuck" during an operation. Sometimes recovery of an MDS may involve operations that take longer than expected (from the programmer's perspective). This is more likely when recovery has already taken longer than normal to complete (which, if you're reading this document, is likely the situation you find yourself in). Avoid unnecessary replacement loops by running the following command and extending the heartbeat grace period: .. prompt:: bash # ceph config set mds mds_heartbeat_grace 3600 .. note:: This causes the MDS to continue to send beacons to the monitors even when its internal "heartbeat" mechanism has not been reset (it has not beaten) in one hour. In the past, this was achieved with the ``mds_beacon_grace`` monitor setting. * **Disable open-file-table prefetch.** Under normal circumstances, the MDS prefetches directory contents during recovery as a way of heating up its cache. During a long recovery, the cache is probably already hot **and large**. If the cache is already hot and large, this prefetching is unnecessary and can be undesirable. Disable open-file-table prefetching by running the following command: .. prompt:: bash # ceph config set mds mds_oft_prefetch_dirfrags false * **Turn off clients.** Clients that reconnect to the newly ``up:active`` MDS can create new load on the file system just as it is becoming operational. This is often undesirable. Maintenance is often necessary before allowing clients to connect to the file system and before resuming a regular workload. For example, expediting the trimming of journals may be advisable if the recovery took a long time due to the amount of time replay spent in reading a very large journal. Client sessions can be refused manually, or by using the ``refuse_client_session`` tunable as in the following command: .. prompt:: bash # ceph fs set refuse_client_session true This command has the effect of preventing clients from establishing new sessions with the MDS. * **Do not tweak max_mds.** Modifying the file-system setting variable ``max_mds`` may seem like a good idea during troubleshooting and recovery, but it probably isn't. Modifying ``max_mds`` might have the effect of further destabilizing the cluster. If ``max_mds`` must be changed in such circumstances, run the command to change ``max_mds`` with the confirmation flag (``--yes-i-really-mean-it``). .. _pause-purge-threads: * **Turn off async purge threads.** The volumes plugin spawns threads that asynchronously purge trashed or deleted subvolumes. During troubleshooting or recovery, these purge threads can be disabled by running the following command: .. prompt:: bash # ceph config set mgr mgr/volumes/pause_purging true To resume purging, run the following command: .. prompt:: bash # ceph config set mgr mgr/volumes/pause_purging false .. _pause-clone-threads: * **Turn off async cloner threads.** The volumes plugin spawns threads that asynchronously clone subvolume snapshots. During troubleshooting or recovery, these cloner threads can be disabled by running the following command: .. prompt:: bash # ceph config set mgr mgr/volumes/pause_cloning true To resume cloning, run the following command: .. prompt:: bash # ceph config set mgr mgr/volumes/pause_cloning false Expediting MDS journal trim =========================== ``MDS_HEALTH_TRIM`` warnings indicate that the MDS journal has grown too large. When the MDS journal has grown too large, use the ``mds_tick_interval`` tunable to modify the "MDS tick interval". The "tick" interval drives various upkeep activities in the MDS, and modifying the interval will decrease the size of the MDS journal by ensuring that it is trimmed more frequently. Make sure that there is no significant file-system load present when modifying ``mds_tick_interval``. See :ref:`cephfs_troubleshooting_avoiding_recovery_roadblocks` for ways to reduce load on the CephFS. This setting affects only MDSes in the ``up:active`` state. The MDS does not trim its journal during recovery. Run the following command to modify the ``mds_tick_interval`` tunable: .. prompt:: bash # ceph config set mds mds_tick_interval 2 RADOS Health ============ If part of the CephFS metadata or data pools is unavailable and CephFS is not responding, it could indicate that RADOS itself is unhealthy. Resolve problems with RADOS before attempting to locate any problems in CephFS. See the :ref:`RADOS troubleshooting documentation`. The MDS ======= Run the ``ceph health`` command. Any operation that is hung in the MDS is indicated by the ``slow requests are blocked`` message. Messages that read ``failing to respond`` indicate that a client is failing to respond. The following list details potential causes of hung operations: #. The system is overloaded. The most likely cause of system overload is an active file set that is larger than the MDS cache. If you have extra RAM, increase the ``mds_cache_memory_limit``. The specific tunable ``mds_cache_memory_limit`` is discussed in the :ref:`MDS Cache Size`. Read the :ref:`MDS Cache Configuration` section in full before making any alterations to the ``mds_cache_memory_limit`` tunable. #. There is an older (misbehaving) client. #. There are underlying RADOS issues. See :ref:`The RADOS troubleshooting documentation`. Otherwise, you have probably discovered a new bug and should report it to the developers! .. _ceph_fuse_debugging: ceph-fuse debugging =================== ceph-fuse is an alternative to the CephFS kernel driver that mounts CephFS file systems in user space. ceph-fuse supports ``dump_ops_in_flight``. Use the following command to dump in-flight ceph-fuse operations for examination: .. .. prompt:: bash # the command goes here - 10 Aug 2025 See the :ref:`Mount CephFS using FUSE` documentation. Debug output ------------ To get more debugging information from ceph-fuse, list current operations in the foreground while logging to the console (``-d``), enabling client debug (``--debug-client=20``), and enabling prints for each message sent (``--debug-ms=1``). .. prompt:: bash # ceph daemon -d mds. dump_ops_in_flight --debug-client=20 --debug-ms=1 If you suspect a potential monitor issue, enable monitor debugging as well (``--debug-monc=20``) by running a command of the following form: .. prompt:: bash # ceph daemon -d mds. dump_ops_in_flight --debug-client=20 --debug-ms=1 --debug-monc=20 .. _kernel_mount_debugging: Kernel mount debugging ====================== The first step in diagnosing and repairing an issue with the kernel client is determining whether the problem is in the kernel client or in the MDS. If the kernel client itself is broken, evidence of its breakage will be in the kernel ring buffer, which can be examined by running the following command: .. prompt:: bash # dmesg Find the relevant kernel state. Slow requests ------------- Unfortunately, the kernel client does not provide an admin socket. However, the the kernel on the client has `debugfs `_ enabled, interfaces similar to the admin socket are available. Find a folder in ``/sys/kernel/debug/ceph/`` with a name like ``28f7427e-5558-4ffd-ae1a-51ec3042759a.client25386880``. That folder contains files that can be used to diagnose the causes of slow requests. Use ``cat`` to see their contents. These files are described below. The files most useful for diagnosis of slow requests are the ``mdsc`` (current requests to the MDS) and the ``osdc`` (current operations in-flight to OSDs) files. * ``bdi``: BDI info about the Ceph system (blocks dirtied, written, etc) * ``caps``: counts of file "caps" structures in-memory and used * ``client_options``: dumps the options provided to the CephFS mount * ``dentry_lru``: Dumps the CephFS dentries currently in-memory * ``mdsc``: Dumps current requests to the MDS * ``mdsmap``: Dumps the current MDSMap epoch and MDSes * ``mds_sessions``: Dumps the current sessions to MDSes * ``monc``: Dumps the current maps from the monitor, and any "subscriptions" held * ``monmap``: Dumps the current monitor map epoch and monitors * ``osdc``: Dumps the current ops in-flight to OSDs (ie, file data IO) * ``osdmap``: Dumps the current OSDMap epoch, pools, and OSDs If the data pool is in a ``NEARFULL`` condition, then the kernel CephFS client will switch to doing writes synchronously. Synchronous writes are quite slow. Disconnected+Remounted FS ========================= Because CephFS has a "consistent cache", the MDS will forcibly evict (and blocklist) clients from the cluster when the network connection has been disrupted for a long time. When this happens, the kernel client cannot safely write back dirty (buffered) data and this results in data loss. However: note that this behavior is appropriate and also follows POSIX semantics. The client has to be remounted to be able to access the file system again. This is the default behavior but it can be overridden by the ``recover_session`` mount option. See `the "options" section of the "mount.ceph" man page `_ You are in this situation if the output of ``dmesg`` contains something like the following:: [Fri Aug 15 02:38:10 2025] ceph: mds0 caps stale [Fri Aug 15 02:38:28 2025] libceph: mds0 (2)XXX.XX.XX.XX :6800 socket closed (con state OPEN) [Fri Aug 15 02:38:28 2025] libceph: mds0 (2)XXX.XX.XX.XX:6800 session reset [Fri Aug 15 02:38:28 2025] ceph: mds0 closed our session [Fri Aug 15 02:38:28 2025] ceph: mds0 reconnect start [Fri Aug 15 02:38:28 2025] ceph: mds0 reconnect denied Mounting ======== Mount 5 Error ------------- A ``mount 5`` error indicates a lagging MDS server or a crashed MDS server. Ensure that at least one MDS is up and running, and the cluster is ``active + healthy``. Mount 12 Error -------------- A mount 12 error with a message reading ``cannot allocate memory`` indicates a version mismatch between the :term:`Ceph Client` version and the :term:`Ceph Storage Cluster` version. Check the versions using the following command: .. prompt:: bash # ceph -v If the Ceph Client is of an older version than the Ceph cluster, upgrade the Client: .. prompt:: bash # sudo apt-get update && sudo apt-get install ceph-common If this fails to resolve the problem, uninstall, autoclean, and autoremove the ``ceph-common`` package and then reinstall it to ensure that you have the latest version of it. Dynamic Debugging ================= Dynamic debugging for CephFS kernel driver allows to enable or disable debug logging. The kernel driver logs are written to the kernel ring buffer and can be examined using ``dmesg(1)`` utility. Debug logging is disabled by default because enabling debug logging can result in system slowness and a drop in I/O throughput. Enable dynamic debug against the CephFS module. See: https://github.com/ceph/ceph/blob/master/src/script/kcon_all.sh Note: Running the above script enables debug logging for the CephFS kernel driver, libceph, and the kernel RBD module. To enable debug logging for a specific component (for example, for the CephFS kernel driver), run a command of the following form: .. prompt:: bash # echo 'module ceph +p' > /sys/kernel/debug/dynamic_debug/control To disable debug logging, run a command of the following form: .. prompt:: bash # echo 'module ceph -p' > /sys/kernel/debug/dynamic_debug/control In-memory Log Dump ================== In-memory logs can be dumped by setting ``mds_extraordinary_events_dump_interval`` when the log level is set to less than ``10``. ``mds_extraordinary_events_dump_interval`` is the interval in seconds for dumping the recent in-memory logs when there is an extraordinary event. Extraordinary events include the following: * Client Eviction * Missed Beacon ACK from the monitors * Missed Internal Heartbeats In-memory log dump is disabled by default. This prevents production environments from experiencing log file bloat by default. Run the following two commands in order to enable in-memory log dumping: #. .. prompt:: bash # ceph config set mds debug_mds / Set ``log_level`` to a value of less than ``10``. Set ``gather_level`` to a value greater than ``10``. When those two values have been set, in-memory log dump is enabled. #. .. prompt:: bash # ceph config set mds mds_extraordinary_events_dump_interval When in-memory log dumping is enabled, the MDS checks for extraordinary events every ``mds_extraordinary_events_dump_interval`` seconds. If any extraordinary event occurs, the MDS dumps the in-memory logs that contain relevant event details to the Ceph MDS log. .. note:: When higher log levels are set (``log_level`` greater than or equal to ``10``) there is no reason to dump the in-memory logs. A lower gather level (``gather_level`` less than ``10``) is insufficient to gather in- memory logs. This means that a log level of greater than or equal to ``10`` or a gather level of less than ``10`` in ``debug_mds`` prevents enabling in-memory-log dumping. In such cases, if there is a failure, you must reset the value of ``mds_extraordinary_events_dump_interval`` to ``0`` before enabling the use of the above commands. Disable in-memory log dumping by running the following command: .. prompt:: bash # ceph config set mds mds_extraordinary_events_dump_interval 0 Filesystems Become Inaccessible After an Upgrade ================================================ .. note:: You can avoid ``operation not permitted`` errors by running this procedure before an upgrade. As of May 2023, it seems that ``operation not permitted`` errors of the kind discussed here occur after upgrades after Nautilus (inclusive). IF you have CephFS file systems that have data and metadata pools that were created by a ``ceph fs new`` command (meaning that they were not created with the defaults) OR you have an existing CephFS file system and are upgrading to a new post-Nautilus major version of Ceph THEN in order for the documented ``ceph fs authorize...`` commands to function as documented (and to avoid 'operation not permitted' errors when doing file I/O or similar security-related problems for all users except the ``client.admin`` user), you must first run: .. prompt:: bash $ ceph osd pool application set cephfs metadata and .. prompt:: bash $ ceph osd pool application set cephfs data Otherwise, when the OSDs receive a request to read or write data (not the directory info, but file data) they will not know which Ceph file system name to look up. This is true also of pool names, because the 'defaults' themselves changed in the major releases, from:: data pool=fsname metadata pool=fsname_metadata to:: data pool=fsname.data and metadata pool=fsname.meta Any setup that used ``client.admin`` for all mounts did not run into this problem, because the admin key gave blanket permissions. A temporary fix involves changing mount requests to the 'client.admin' user and its associated key. A less drastic but half-fix is to change the osd cap for your user to just ``caps osd = "allow rw"`` and delete ``tag cephfs data=....`` Disabling the Volumes Plugin ============================ In certain scenarios, the Volumes plugin may need to be disabled to prevent compromise for rest of the Ceph cluster. For details see: :ref:`disabling-volumes-plugin` Reporting Issues ================ If you have identified a specific issue, please report it with as much information as possible. Especially important information: * Ceph versions installed on client and server * Whether you are using the kernel or fuse client * If you are using the kernel client, what kernel version? * How many clients are in play, doing what kind of workload? * If a system is 'stuck', is that affecting all clients or just one? * Any ceph health messages * Any backtraces in the ceph logs from crashes If you are satisfied that you have found a bug, please file it on `the bug tracker`. For more general queries, please write to the `ceph-users mailing list`. .. _the bug tracker: http://tracker.ceph.com .. _ceph-users mailing list: http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com/