"Object was retained too many times"

I have a long-running server written in Swift, using SwiftNIO. After 3.5 days of successful operation, one of the longest uptimes yet, I got this fatal error and stack trace:

Fatal error: Object was retained too many timesCurrent stack trace:
0    libswiftCore.so                    0x00007f9043a73140 swift_reportError + 50
1    libswiftCore.so                    0x00007f9043a7321d <unavailable> + 4608541
2    libswiftCore.so                    0x00007f9043a73354 <unavailable> + 4608852
3    libswiftCore.so                    0x00007f9043a976b9 <unavailable> + 4757177
4    libswiftCore.so                    0x00007f9043a73fe0 swift_retain + 86
Aborted (core dumped)

Unfortunately from the stack trace I can't tell what was being retained. This appears to be a very obscure error. No references to it on Swift Forums. A google search for "Object was retained too many times" gives exactly 5 results, only 3 of them in English.

It appears to be this call in apple/swift.

The intuition is to run my server in Xcode's memory analyzer. This is extremely difficult for a handful of reasons:

  1. Will I have to run it for 3.5 days to encounter the same error?
  2. My MacBook doesn't have the CPU capability to run the server. I might potentially dramatically reduce the number of connections the server is set to handle. Would this prevent the error from occurring?
  3. Currently the code doesn't compile on macOS due to differences in the standard library across platforms. These can be patched, with effort.
  4. The server's requests require access keys and IP whitelisting that my MacBook does not have.
    To date all development has been on Linux command line.

I think if I'm going to resolve this issue it's most likely going to be by methodically changing every portion of the code that could potentially cause such a retain overflow.

  • How many retains is "too many"? 64,000? 2 billion? I searched around apple/swift for swift_abortRetainOverflow but wasn't able to discover the answer to this question.
  • What sort of code smells might cause such a retain overflow? Adding a particular Swift object or struct to too many Arrays? My code has a lot of recursive calls. Some networks requests need to occur every 15 seconds or every minute, so they schedule themselves in a loop. Due to tail call optimization this isn't overflowing the stack, but could it be overflowing retains?

Any experience with this or suggestions on what to look for would be greatly appreciated.

3 Likes

Thanks for the report! The limit is 2^31, which ought to be highly unlikely for any object to reach in normal circumstances. It's likely that we have a compiler bug causing an unbalanced retain that accretes over time until the limit is hit.

To track down what object is triggering the overflow, you might be able to hook into swift_retain so that it raises an alert when an object exceeds a lower threshold, like 1000 or something. There are global variables containing function pointers to the retain/release implementation that you can reassign to hook the operations, so maybe you can do something like this in a C file you mix into your Swift project:

extern void *(*_swift_retain)(void *);

static void *(*_old_swift_retain)(void*);

size_t swift_retainCount(void *);

static void *swift_retain_hook(void *object) {
  if (swift_retainCount(object) > 1000)
    fprintf(stderr, "object at %p has 1000 retains!", object);
  return _old_swift_retain(object);
}

__attribute__((constructor))
static void hook_swift_retain() {
  _old_swift_retain = _swift_retain;
  _swift_retain = swift_retain_hook;
}
11 Likes

Thank you very much @Joe_Groff. I've done the following:

In Package.swift:

        .target(name: "RetainTroubleshooting"),
        .target(name: "MyServerLib", dependencies: ["RetainTroubleshooting", ...]

Inside Sources/:
Screen Shot 2020-10-02 at 16.17.20

retain.c:

#include <stdlib.h>
#include <stdio.h>

extern void *(*_swift_retain)(void *);

static void *(*_old_swift_retain)(void*);

size_t swift_retainCount(void *);

static void *swift_retain_hook(void *object) {
  if (swift_retainCount(object) > 1000) {
      fprintf(stderr, "object at %p has %zu retains!\n", object, swift_retainCount(object));
  }
  return _old_swift_retain(object);
}

__attribute__((constructor))
static void hook_swift_retain() {
  _old_swift_retain = _swift_retain;
  _swift_retain = swift_retain_hook;
}

modulemap.module:

module RetainTroubleshooting {
	header "include/retain.h"
	export *
}

retain.h:

#include <stdlib.h>
#include <stdio.h>

static void hook_swift_retain();

size_t swift_retainCount(void *);

This works. Seconds after starting my server I see retains in the low thousands, and after a handful of minutes I see half a million retains on certain objects:

object at 0x7f243c002670 has 514384 retains!                                                                          [0/1924]
object at 0x7f243c002670 has 514385 retains!
object at 0x7f243c002670 has 514385 retains!
object at 0x7f243c002670 has 514386 retains!
object at 0x7f243c002670 has 514386 retains!
object at 0x7f243c002670 has 514387 retains!
object at 0x7f243c002670 has 514386 retains!
object at 0x7f243c002670 has 514387 retains!
object at 0x7f243c002670 has 514388 retains!
object at 0x7f243c002670 has 514389 retains!
object at 0x7f243c002670 has 514387 retains!
object at 0x7f243c002670 has 514388 retains!
object at 0x7f243c002670 has 514388 retains!
object at 0x7f243c002670 has 514389 retains!
object at 0x7f243c002670 has 514389 retains!
object at 0x7f243c002670 has 514390 retains!
object at 0x7f243c002670 has 514390 retains!
object at 0x7f243c002670 has 514391 retains!
object at 0x7f243c002670 has 514391 retains!
object at 0x7f243c002670 has 514392 retains!
object at 0x7f243c002670 has 514392 retains!
object at 0x7f243c002670 has 514393 retains!
object at 0x7f243c002670 has 514393 retains!
object at 0x7f243c002670 has 514394 retains!
object at 0x7f243c002670 has 514394 retains!
object at 0x7f243c002670 has 514395 retains!
object at 0x7f243c002670 has 514395 retains!
object at 0x7f243c002670 has 514396 retains!

It's not always the same object:

object at 0x7f25bcb53458 has 1963 retains!
object at 0x7f25bcb53458 has 1961 retains!
object at 0x7f25bcb53458 has 1961 retains!
object at 0x7f25bcb53458 has 1961 retains!
object at 0x7f243c002670 has 662592 retains!
object at 0x7f243c002670 has 662593 retains!
object at 0x7f243c002670 has 662593 retains!
object at 0x7f243c002670 has 662594 retains!
object at 0x7f243c002670 has 662594 retains!
object at 0x7f243c002670 has 662595 retains!
object at 0x7f25bcb53458 has 1961 retains!
object at 0x7f25bcb53458 has 1962 retains!
object at 0x7f25bcb53458 has 1963 retains!
object at 0x7f25bcb53458 has 1964 retains!
object at 0x7f25bcb53458 has 1964 retains!

Now to understand what these objects are. Now that I have these pointers, how do I inspect them to determine what objects they're pointing to?

It's worth pointing out that there is no obvious memory leak over the 3.5 day lifetime of my server run:


Minutes after starting, the memory usage of 32GB was at 9.79% and 3.5 days later it was at 11.35%.

If you're able to attach to the running process with lldb, hopefully it can help by letting you evaluate Swift expressions on the object, like:

(lldb) po unsafeBitCast(0x7f243c002670, to: AnyObject.self)

Otherwise, you could try logging more information from the retain hook. The Swift runtime has a function swift_getTypeName that can give you a human-readable string from an isa pointer:

const char *swift_getTypeName(void *classObject, _Bool qualified);

static void *swift_retain_hook(void *object) {
  if (swift_retainCount(object) > 1000) {
    void *isa = *(void**)object;
    const char *className = swift_getTypeName(isa, 1);
    fprintf(stderr, "%s at %p has 1000 retains!", className, object);
  }
  return _old_swift_retain(object);
}
2 Likes

Once again, extremely helpful @Joe_Groff, thank you! Some example output:

Fri Oct  2 14:17:04 2020: Foundation.NSLocale at 0x7f2828002670 has 11487 retains!
Fri Oct  2 14:17:04 2020: Foundation.NSLocale at 0x7f2828002670 has 11488 retains!
Fri Oct  2 14:17:04 2020: Foundation.NSLocale at 0x7f2828002670 has 11487 retains!
Fri Oct  2 14:17:04 2020: Foundation.NSLocale at 0x7f2828002670 has 11488 retains!
Fri Oct  2 14:17:04 2020: Foundation.NSLocale at 0x7f2828002670 has 11488 retains!
Fri Oct  2 14:17:04 2020: Foundation.NSLocale at 0x7f2828002670 has 11489 retains!
Fri Oct  2 14:17:04 2020: Foundation.NSLocale at 0x7f2828002670 has 11489 retains!
Fri Oct  2 14:17:04 2020: Foundation.NSLocale at 0x7f2828002670 has 11490 retains!
Fri Oct  2 14:17:04 2020: Foundation.NSLocale at 0x7f2828002670 has 11490 retains!
Fri Oct  2 14:17:04 2020: Foundation.NSLocale at 0x7f2828002670 has 11491 retains!
Fri Oct  2 14:17:04 2020: Foundation.NSLocale at 0x7f2828002670 has 11491 retains!
Fri Oct  2 14:17:04 2020: Foundation.NSLocale at 0x7f2828002670 has 11492 retains!
Fri Oct  2 14:17:04 2020: Swift._ContiguousArrayStorage<Logging.LogHandler> at 0x55bee922ca50 has 2944 retains!
Fri Oct  2 14:17:04 2020: Swift._ContiguousArrayStorage<Logging.LogHandler> at 0x55bee922ca50 has 2945 retains!
Fri Oct  2 14:17:04 2020: Swift._ContiguousArrayStorage<Logging.LogHandler> at 0x55bee922ca50 has 2946 retains!
Fri Oct  2 14:17:04 2020: Swift._ContiguousArrayStorage<Logging.LogHandler> at 0x55bee922ca50 has 2947 retains!
Fri Oct  2 14:17:04 2020: Swift._ContiguousArrayStorage<Logging.LogHandler> at 0x55bee922ca50 has 2945 retains!
Fri Oct  2 14:17:04 2020: Swift._ContiguousArrayStorage<Logging.LogHandler> at 0x55bee922ca50 has 2946 retains!
Fri Oct  2 14:17:04 2020: Swift._ContiguousArrayStorage<Logging.LogHandler> at 0x55bee922ca50 has 2947 retains!
Fri Oct  2 14:17:04 2020: Swift._ContiguousArrayStorage<Logging.LogHandler> at 0x55bee922ca50 has 2947 retains!
Fri Oct  2 14:17:04 2020: Swift._ContiguousArrayStorage<Logging.LogHandler> at 0x55bee922ca50 has 2947 retains!
Fri Oct  2 14:17:04 2020: MyLib.FileLogHandler at 0x55bee8d8c320 has 1266 retains!
Fri Oct  2 14:17:04 2020: MyLib.FileLogHandler at 0x55bee8d8c320 has 1267 retains!
Fri Oct  2 14:17:04 2020: MyLib.FileLogHandler at 0x55bee8d8c320 has 1267 retains!
Fri Oct  2 14:17:04 2020: Swift._ContiguousArrayStorage<Logging.LogHandler> at 0x55bee922ca50 has 2947 retains!
Fri Oct  2 14:17:04 2020: <<< invalid type >>> at 0x55bee8d99050 has 1266 retains!
Fri Oct  2 14:17:04 2020: <<< invalid type >>> at 0x55bee8d99050 has 1267 retains!

I am seeing my Logging.LogHandler goes up to ~2900 retains and then remains stable around that number for the duration of my 10 minute test. This isn't too surprising. I have resisted creating a global logger and instead I pass it to all functions and objects as an argument. This is to make it easier to segment different parts of the app onto different loggers so I can break it up into different libraries in the future. Perhaps I should give up and create global loggers. There are ~400 connections and several long-lived objects per connection and the Logger is passed to each object, so this number checks out. However, it stays around 2900 for the duration of my test so it's not the cause of our problems.

I see only one object that ever goes above 10,000 retains, and indeed it explodes way beyond that:

Fri Oct  2 14:23:49 2020: Foundation.NSLocale at 0x7f2828002670 has 1012416 retains!

10 minutes after start it has over 1M retains. This is interesting. I have no references to NSLocale in my code. The only occurrences of locale are the following:

extension Date {
    func toRFC5254String() -> String {
        let dateFormatter = DateFormatter()
        dateFormatter.locale = Locale(identifier: "en_US_POSIX")
        dateFormatter.dateFormat = "yyyy-MM-dd'T'HH:mm:ss.SSSZZ"
        return dateFormatter.string(from: self)
    }

    static func millisecondsStringToDate(_ string: String) -> Date? {
        let dateFormatter = DateFormatter()
        dateFormatter.locale = Locale(identifier: "en_US_POSIX")
        dateFormatter.dateFormat = "yyyy-MM-dd'T'HH:mm:ss.SSSSSSZZZZZZ"
        if let date = dateFormatter.date(from: string) {
            return date
        } else {
            dateFormatter.dateFormat = "yyyy-MM-dd'T'HH:mm:ssZZZZZZ"
            return dateFormatter.date(from: string)
        }
    }
}

I should look through my project's dependencies to see if they're doing anything with NSLocale.

This is on Ubuntu 18.04 with Swift for Tensorflow 0.10:

$ which swift
/home/xander/swift-tensorflow-RELEASE-0.10-cuda10.2-cudnn7-ubuntu18.04/usr/bin/swift
$ swift --version
Swift version 5.3-dev (LLVM 55d27a5828, Swift 6a5d84ec08)
Target: x86_64-unknown-linux-gnu

I'm unsure if the above reproduces on Swift dev trunk, naturally my project doesn't build without TensorFlow. There is a newer 0.11 version of Swift for Tensorflow but I'm unable to use it because of a compiler bug in Swift dev trunk, still present as of 28 Sept 2020 snapshot (reported here).

Please open a bug at bugs.swift.org. I have an idea of what's going on.

2 Likes

That dateFormatter code you show makes a new formatter and locale every time these methods are called. You could easily rewrite the code to use a global formatter that's only created once. That might or might not be better than this code but it would show you if the problem is in these functions.

Bug reported in SR-13649, thanks for taking a look.

Unfortunately this isn't so easy. This is a highly multi-threaded application and using a a single global DateFormatter caused segfaults within seconds of starting execution. The DateFormatter is being called on each Logger message as well as each creation of a Codable object for insertion into BigQuery via JSON. So, I tried replacing it in the logger:

Originally my Logger message string looked like this:

          let string = """
         \(Date().toRFC5254String()) \(id) \(level) : \
          \(URL(fileURLWithPath: file).lastPathComponent):\(function):\(line) : \(message)\n
          """

I changed this to the following:

    private func timestamp() -> String {
         var buffer = [Int8](repeating: 0, count: 255)
         var timestamp = time(nil)
         let localTime = localtime(&timestamp)
         strftime(&buffer, buffer.count, "%Y-%m-%dT%H:%M:%S%z", localTime)
         return buffer.withUnsafeBufferPointer {
             $0.withMemoryRebound(to: CChar.self) {
                 String(cString: $0.baseAddress!)
             }
         }
     }

        let string = """
        \(self.timestamp()) \(modelId) \(level) : \
        \(URL(fileURLWithPath: file).lastPathComponent):\(function):\(line) : \(message)\n
        """

The above uses pure C to get the timestamps as a String, so no Locale is created or used at all. Date is never called. Unfortunately this had no effect on the NSLocale retains accreting. The use of it in my Logger is not the cause, then.

I then removed the references in my Codables, replacing their Dates with bogus values while turning off production database insertion. I was able to completely comment out the previously posted Date extensions and run, but the NSLocale retains still explode.

There are now absolutely no references to locale in my project and the issue persists. :frowning_face: Tomorrow I will dig through my project's dependencies to see if they're doing anything with locales.

I suspect this is not within your code at all. There are likely to be retains of NSLocale within Foundation itself.

One way you can test this is to crash the program when the retain count gets too high, and then run the program under lldb. Lldb will automatically take over when the program crashes and give you a traceback. This should point at the code that is emitting the retains.

1 Like

Good idea @lukasa!

I added this to the swift_retain_hook in retain.c:

      if (swift_retainCount(object) > 1000000) {
          abort();
      }

Running in lldb, I see a stream of errors when it hits the abort and halts:

error: could not build C module 'CNIOBoringSSLShims'
)
warning: (x86_64) /home/xander/dev/model/.build/x86_64-unknown-linux-gnu/debug/ModelApp unable to load swift module "NIOTLS" (failed to get module "AsyncHTTPClient" from AST context:
<module-includes>:1:10: note: in file included from <module-includes>:1:
#include "/home/xander/dev/model/.build/checkouts/swift-nio-ssl/Sources/CNIOBoringSSLShims/include/CNIOBoringSSLShims.h"
         ^

error: /home/xander/dev/model/.build/checkouts/swift-nio-ssl/Sources/CNIOBoringSSLShims/include/CNIOBoringSSLShims.h:22:10: error: 'CNIOBoringSSL.h' file not found
#include "CNIOBoringSSL.h"
         ^
 
error: could not build C module 'CNIOBoringSSLShims'
)
warning: (x86_64) /home/xander/dev/model/.build/x86_64-unknown-linux-gnu/debug/ModelApp unable to load swift module "NIOWebSocket" (failed to get module "AsyncHTTPClient" from AST context:
<module-includes>:1:10: note: in file included from <module-includes>:1:
#include "/home/xander/dev/model/.build/checkouts/swift-nio-ssl/Sources/CNIOBoringSSLShims/include/CNIOBoringSSLShims.h"
         ^

error: /home/xander/dev/model/.build/checkouts/swift-nio-ssl/Sources/CNIOBoringSSLShims/include/CNIOBoringSSLShims.h:22:10: error: 'CNIOBoringSSL.h' file not found
#include "CNIOBoringSSL.h"
         ^

error: could not build C module 'CNIOBoringSSLShims'
)
warning: (x86_64) /home/xander/dev/model/.build/x86_64-unknown-linux-gnu/debug/ModelApp unable to load swift module "OAuth2" (failed to get module "AsyncHTTPClient" from AST context:
<module-includes>:1:10: note: in file included from <module-includes>:1:
#include "/home/xander/dev/model/.build/checkouts/swift-nio-ssl/Sources/CNIOBoringSSLShims/include/CNIOBoringSSLShims.h"
         ^

error: /home/xander/dev/model/.build/checkouts/swift-nio-ssl/Sources/CNIOBoringSSLShims/include/CNIOBoringSSLShims.h:22:10: error: 'CNIOBoringSSL.h' file not found
#include "CNIOBoringSSL.h"
         ^

error: could not build C module 'CNIOBoringSSLShims'
)
warning: (x86_64) /home/xander/dev/model/.build/x86_64-unknown-linux-gnu/debug/ModelApp unable to load swift module "OrderedDictionary" (failed to get module "AsyncHTTPClient" from AST context:
<module-includes>:1:10: note: in file included from <module-includes>:1:
#include "/home/xander/dev/model/.build/checkouts/swift-nio-ssl/Sources/CNIOBoringSSLShims/include/CNIOBoringSSLShims.h"
         ^

error: /home/xander/dev/model/.build/checkouts/swift-nio-ssl/Sources/CNIOBoringSSLShims/include/CNIOBoringSSLShims.h:22:10: error: 'CNIOBoringSSL.h' file not found
#include "CNIOBoringSSL.h"
         ^

error: could not build C module 'CNIOBoringSSLShims'
)
warning: (x86_64) /home/xander/dev/model/.build/x86_64-unknown-linux-gnu/debug/ModelApp unable to load swift module "PenguinCSV" (failed to get module "AsyncHTTPClient" from AST context:
<module-includes>:1:10: note: in file included from <module-includes>:1:
#include "/home/xander/dev/model/.build/checkouts/swift-nio-ssl/Sources/CNIOBoringSSLShims/include/CNIOBoringSSLShims.h"
         ^

error: /home/xander/dev/model/.build/checkouts/swift-nio-ssl/Sources/CNIOBoringSSLShims/include/CNIOBoringSSLShims.h:22:10: error: 'CNIOBoringSSL.h' file not found
#include "CNIOBoringSSL.h"
         ^

And then the process info:

Process 30858 stopped
* thread #163, name = 'NIO-ELT-0-#6', stop reason = signal SIGABRT
    frame #0: 0x00007fffad62ff47 libc.so.6`gsignal + 199
libc.so.6`gsignal:
->  0x7fffad62ff47 <+199>: movq   0x108(%rsp), %rcx
    0x7fffad62ff4f <+207>: xorq   %fs:0x28, %rcx
    0x7fffad62ff58 <+216>: movl   %r8d, %eax
    0x7fffad62ff5b <+219>: jne    0x7fffad62ff7c            ; <+252>
  thread #171, name = 'NIO-ELT-0-#14', stop reason = signal SIGABRT
    frame #0: 0x00007fffad62ff47 libc.so.6`gsignal + 199
libc.so.6`gsignal:
->  0x7fffad62ff47 <+199>: movq   0x108(%rsp), %rcx
    0x7fffad62ff4f <+207>: xorq   %fs:0x28, %rcx
    0x7fffad62ff58 <+216>: movl   %r8d, %eax
    0x7fffad62ff5b <+219>: jne    0x7fffad62ff7c            ; <+252>
Target 0: (ModelApp) stopped.

frame info makes sense:

frame #0: 0x00007fffad62ff47 libc.so.6`gsignal + 199

And then, when I attempt thread backtrace:

lldb: /swift-base/llvm-project/lldb/source/Symbol/TypeSystemSwiftTypeRef.cpp:904: virtual bool lldb_private::TypeSystemSwiftTypeRef::IsFunctionType(lldb::opaque_compiler_type_t, bool *): Assertion `(Equivalent(result, (m_swift_ast_context->IsFunctionType( ReconstructType(type), nullptr))) || Whitelisted(*m_swift_ast_context)) && "TypeSystemSwiftTypeRef diverges from SwiftASTContext"' failed.
Aborted (core dumped)

And lldb has exited. :frowning:

$ find / -name CNIOBoringSSL.h 2> /dev/null
/home/xander/dev/model/.build/checkouts/swift-nio-ssl/Sources/CNIOBoringSSL/include/CNIOBoringSSL.h

This seems to be right?

Symbolification on Linux has always been difficult for me. This build is swift build -Xlinker --export-dynamic -Xswiftc -g. I don't think I see anything related to our issue here, need to figure out how to make lldb work.

Following this discussion, I got lldb to produce a backtrace using this as the build command:

swift build -Xlinker --export-dynamic -Xswiftc -g -Xswiftc -Xfrontend -Xswiftc -serialize-debugging-options

Our backtrace:

(lldb) thread backtrace
* thread #170, name = 'NIO-ELT-0-#13', stop reason = signal SIGABRT
  * frame #0: 0x00007fffad62ff47 libc.so.6`gsignal + 199
    frame #1: 0x00007fffad6318b1 libc.so.6`abort + 321
    frame #2: 0x00005555560b094f ModelApp`swift_retain_hook(object=0x00007ffe80002670) at retain.c:24:11
    frame #3: 0x00007ffff6667ece libFoundation.so`CFRetain + 14
    frame #4: 0x00007ffff669a77d libFoundation.so`__CreateCFDateFormatter + 157
    frame #5: 0x00007ffff636149f libFoundation.so`Foundation.NSDate.description.getter : Swift.String + 159
    frame #6: 0x00007ffff6102ba6 libFoundation.so`Foundation.Date.description.getter : Swift.String + 70
    frame #7: 0x00007ffff61031df libFoundation.so`protocol witness for Swift.CustomStringConvertible.description.getter : Swift.String in conformance Foundation.Date : Swift.CustomStringConvertible in Foundation + 15
    frame #8: 0x00007ffff7960cd7 libswiftCore.so`Swift.DefaultStringInterpolation.appendInterpolation<Ď„_0_0 where Ď„_0_0: Swift.CustomStringConvertible>(Ď„_0_0) -> () + 23
    frame #9: 0x0000555555feb7d7 ModelApp`closure #1 in MyData.Callback(dataUpdates=3 values, self=0x0000555557a71be0) at MyData.swift:163:63
    frame #10: 0x0000555555ffffc6 ModelApp`closure #1 in DataStream.handleResponseString(self=0x00007fff0825b850, string="{\"channel\": \"data\", \"type\": \"update\", \"data\": {\"time\": 1601744938.7753208, \"action\": \"update\"}}") at DataStream.swift:152:30
    frame #11: 0x0000555555805e6f ModelApp`thunk for @escaping @callee_guaranteed () -> (@error @owned Error) at <compiler-generated>:0
    frame #12: 0x0000555556000464 ModelApp`thunk for @escaping @callee_guaranteed () -> (@error @owned Error)partial apply at <compiler-generated>:0
    frame #13: 0x0000555555c384d6 ModelApp`closure #1 in EventLoop.submit<Self>(promise=NIO.EventLoopPromise<()> @ 0x00007ffef17f35a8, task=0x0000555556000450 ModelApp`reabstraction thunk helper from @escaping @callee_guaranteed () -> (@error @owned Swift.Error) to @escaping @callee_guaranteed () -> (@out (), @error @owned Swift.Error)partial apply forwarder with unmangled suffix ".30" at <compiler-generated>) at EventLoop.swift:488:37
    frame #14: 0x0000555555cc6f1c ModelApp`thunk for @escaping @callee_guaranteed () -> () at <compiler-generated>:0
    frame #15: 0x0000555555ccb691 ModelApp`partial apply for thunk for @escaping @callee_guaranteed () -> () at <compiler-generated>:0
    frame #16: 0x0000555555cc6f3c ModelApp`thunk for @escaping @callee_guaranteed () -> (@out ()) at <compiler-generated>:0
    frame #17: 0x0000555555cc6f87 ModelApp`closure #4 in SelectableEventLoop.run(task=0x0000555555cca070 ModelApp`partial apply forwarder for reabstraction thunk helper from @escaping @callee_guaranteed () -> (@out ()) to @escaping @callee_guaranteed () -> () at <compiler-generated>) at SelectableEventLoop.swift:454:25
    frame #18: 0x0000555555b6d00f ModelApp`thunk for @callee_guaranteed () -> (@error @owned Error) at <compiler-generated>:0
    frame #19: 0x0000555555cca034 ModelApp`thunk for @callee_guaranteed () -> (@error @owned Error)partial apply at <compiler-generated>:0
    frame #20: 0x0000555555cc0a62 ModelApp`withAutoReleasePool<T>(execute=0x0000555555cca020 ModelApp`reabstraction thunk helper from @callee_guaranteed () -> (@error @owned Swift.Error) to @escaping @callee_guaranteed () -> (@out (), @error @owned Swift.Error)partial apply forwarder with unmangled suffix ".9" at <compiler-generated>) at SelectableEventLoop.swift:26:16
    frame #21: 0x0000555555cc52cc ModelApp`SelectableEventLoop.run(self=0x00007ffed4001000) at SelectableEventLoop.swift:453:21
    frame #22: 0x0000555555c3b82b ModelApp`static MultiThreadedEventLoopGroup.runTheLoop(thread=0x00007ffed4000b20, canEventLoopBeShutdownIndividually=false, selectorFactory=0x0000555555c3c370 ModelApp`implicit closure #1 () throws -> NIO.Selector<NIO.NIORegistration> in NIO.MultiThreadedEventLoopGroup.init(numberOfThreads: Swift.Int) -> NIO.MultiThreadedEventLoopGroup at EventLoop.swift, initializer=0x0000555555c42240 ModelApp`partial apply forwarder for reabstraction thunk helper from @escaping @callee_guaranteed (@in_guaranteed NIO.NIOThread) -> (@out ()) to @escaping @callee_guaranteed (@guaranteed NIO.NIOThread) -> () at <compiler-generated>, callback=0x0000555555c423c0 ModelApp`partial apply forwarder for closure #1 (NIO.SelectableEventLoop) -> () in closure #1 (NIO.NIOThread) -> () in static NIO.MultiThreadedEventLoopGroup.setupThreadAndEventLoop(name: Swift.String, selectorFactory: () throws -> NIO.Selector<NIO.NIORegistration>, initializer: (NIO.NIOThread) -> ()) -> NIO.SelectableEventLoop at <compiler-generated>, self=NIO.MultiThreadedEventLoopGroup) at EventLoop.swift:824:22
    frame #23: 0x0000555555c3bfdf ModelApp`closure #1 in static MultiThreadedEventLoopGroup.setupThreadAndEventLoop(t=0x00007ffed4000b20, selectorFactory=0x0000555555c3c370 ModelApp`implicit closure #1 () throws -> NIO.Selector<NIO.NIORegistration> in NIO.MultiThreadedEventLoopGroup.init(numberOfThreads: Swift.Int) -> NIO.MultiThreadedEventLoopGroup at EventLoop.swift, initializer=0x0000555555c42240 ModelApp`partial apply forwarder for reabstraction thunk helper from @escaping @callee_guaranteed (@in_guaranteed NIO.NIOThread) -> (@out ()) to @escaping @callee_guaranteed (@guaranteed NIO.NIOThread) -> () at <compiler-generated>, lock=(mutex = 0x0000555557a68d50), _loop=0x00007ffed4001000, loopUpAndRunningGroup=0x0000555557a68d80) at EventLoop.swift:844:41
    frame #24: 0x0000555555c42333 ModelApp`partial apply for closure #1 in static MultiThreadedEventLoopGroup.setupThreadAndEventLoop(name:selectorFactory:initializer:) at <compiler-generated>:0
    frame #25: 0x0000555555c3c5ff ModelApp`thunk for @escaping @callee_guaranteed (@guaranteed NIOThread) -> () at <compiler-generated>:0
    frame #26: 0x0000555555d1b5f1 ModelApp`partial apply for thunk for @escaping @callee_guaranteed (@guaranteed NIOThread) -> () at <compiler-generated>:0
    frame #27: 0x0000555555d1eacc ModelApp`closure #1 in static ThreadOpsPosix.run($0=(_rawValue = 0x0000555557a68e90)) at ThreadPosix.swift:105:13
    frame #28: 0x0000555555d1ec69 ModelApp`@objc closure #1 in static ThreadOpsPosix.run(handle:args:detachThread:) at <compiler-generated>:0
    frame #29: 0x00007ffff75ba6db libpthread.so.0`start_thread + 219
    frame #30: 0x00007fffad712a3f libc.so.6`clone + 63

So, it looks like Foundation is creating the DateFormatter each time my project receives some data over the network and decodes it into an object. The MyData.swift:163 it's referencing is this code:

    func updatesCallback(type: MyType) -> (([Update]) -> Void) {
        return { updates in
            let LinkedList = self.updates[type]!
            // The below if statement and for loop entire code block are a debug check to see how often the lack of
            //  FIFO guarantee on async dispatch queues causes old data to be inserted after new data
            if LinkedList.count > 0 && updates.count > 0 {
                for update in updates {
                    let topUpdate = linkedList.suffix(1)![0]
                    let errorText = """
                    Expected FIFO execution of async dispatch queues, but found historical data being inserted onto \
                    linked list. Top of linked list has time \(update.time) \
                    (\(update.timeDouble)), \
                    while inserted update with time \(topUpdate.time) (\(topUpdate.timeDouble))
                    """
                    if topUpdate.time > update.time {
                        self.logger.error(Logger.Message(stringLiteral: errorText))
                        assertionFailure(errorText)
                    }
                }
            }
            linkedList.append(contentsOf: updates)
        }
    }

linked list. Top of linked list has time \(update.time) \ is line 163.

So Foundation is creating a DateFormatter to put the object's Date into the String. I wrote this as a debug clause to test if the lack of FIFO guarantee in dispatch asyncs actually leads to out-of-order data. (It never has). Notice that the errorText is being created even when the the error condition is not met. When I move it inside the if topUpdate.time > update.time, the retain issue completely goes away! :partying_face: I ran for 20+ minutes and never saw any object reach a retain >10000.

So, the compiler bug case appears to be this:

  • Each time data is received over the network, JSON decode it into Update objects that contain Date properties.
  • Call a partially evaluated callback function (like func updatesCallback(type: MyType) -> (([Update]) -> Void)) with every Update.
  • In this function, put the Update's Date property into a string: let string = "\(update.date)". Then do nothing with the variable string.

This is the case causing the retain accretion. Thanks very much @Joe_Groff and @lukasa for teaching me the tools necessary to track this down. Simply by creating an object only when it's actually used I've solved this issue for myself for the time being, but it is a complier bug. I will attempt to provide enough information in the bug report to help the team fix it.

11 Likes

Great diagnosis, well done!

1 Like

Interesting. So can you write a simple test app that just calls a method similar to your updatesCallback method many times that demonstrates this issue?

Technically, it looks like the leak is somewhere inside NSDate.description in the non-Darwin Foundation implementation: either a CFDateFormatter is being leaked or it’s over-retaining the locale object. That may or may not be a compiler bug, but it’s definitely not your bug.

1 Like

Thanks @John_McCall @lukasa and @xanderdunn for this thread, some really valuable stuff in here! I imagine these swift runtime hooks are not meant to be public, but are they documented somewhere for internal use? This kind of stuff is really useful when debugging on other platforms

This week I will endeavor to make a copy of my codebase and strip code and mock the network until I've got a minimal reproducing example. This should build without TensorFlow. Will post if I succeed.

These aren’t so much runtime hooks as they are just...the runtime. In most environments it is possible to hook functions provided by a library by simply defining them yourself in the right place. On Linux this is commonly done with LD_PRELOAD, for example. SwiftNIO does this as part of our allocation counting test framework, where we hook the entry points to the allocator.

The functions you can hook this way are the functions that are exposed by a library as linkable symbols. In this case the library you care about is the runtime itself, and the symbols you want are the runtime symbols. Many of the functions are what I might call “lightly documented” here. I’d be reluctant to assume that this document is entirely up-to-date, but it’s a good starting point. You can also get these symbols by using nm on the Swift runtime library itself.

Once you have a symbol that has a likely name, you can search the Swift codebase to find its function signature in order to forward-declare and call it.

4 Likes

Ok we've got a minimally reproducing code sample here!!

To reproduce:

$ git clone https://github.com/xanderdunn/swift-retain-repro.git
$ cd swift-retain-repro
$ swift run

The line that causes it is defining let errorText at MyData.swift:dataCallback(dataFeeder:):39

Note that it is necessary that my custom data structure BoundedLIFOLinkedList is used to reproduce this bug. When I replaced it with a simple Array, the retain accretion no longer occurred. I'm not using any pointers or unsafe methods in BoundedLIFOLinkedList, so I'm not sure how it is causing it.

Unlike the initial report, this code does not rely on TensorFlow and I was able to reproduce this bug on Ubuntu 18.04 with Swift dev trunk 2020-10-07:

$ which swift
/home/xander/swift-DEVELOPMENT-SNAPSHOT-2020-10-07-a-ubuntu18.04/usr/bin/swift
$ swift --version
Swift version 5.3-dev (LLVM 240312aa7333e90, Swift 15bf0478ad7c47c)
Target: x86_64-unknown-linux-gnu

I've added all of these details to SR-13649.

2 Likes

It took me a few reads of your message and a re-read of the actual thread to understand what you mean by this. I was getting confused because there was much talk of swift_retain_hook but I realise now that there was some function pointer mangling done in order to actually define that hook, and that the name including hook is actually arbitrary. Makes sense now. Either way, very useful, thanks!