p5•11mo ago

Hope someone can help us out here. We'

Hope someone can help us out here. We're running Langfuse in an ECS container, and are able to login and create projects fine. The ECS container is in a constant loop of being spun up, failing and being destroyed, and we're unsure why. We keep receiving a "Error: connect ECONNREFUSED 169.254.172.2:3000". We're unsure what that IP address is - it might be the internal IP of our container, but it's certainly not anything we are knowingly hosting (e.g. load balancer, Postgres etc). This error shows after the container has been running for a couple of minutes. Do you recommend any things we should check to try and debug this?
This is using the Docker image tagged with v1.1.0 (Full error in thread)

10 Replies

p5•11mo ago

Error:

2023-11-19T14:50:38.432+00:00    Prisma schema loaded from prisma/schema.prisma
2023-11-19T14:50:38.539+00:00    Datasource "db": PostgreSQL database "langfuse", schema "public" at "langfuse.testing.<REDACTED>:5432"
2023-11-19T14:50:39.041+00:00    65 migrations found in prisma/migrations
2023-11-19T14:50:39.244+00:00    No pending migrations to apply.
2023-11-19T14:50:43.548+00:00    ▲ Next.js 13.5.6
2023-11-19T14:50:43.549+00:00    - Local: http://ip-10-64-101-126.eu-west-1.compute.internal:3000
2023-11-19T14:50:43.549+00:00    - Network: http://10.64.101.126:3000
2023-11-19T14:50:43.549+00:00    ✓ Ready in 2s
2023-11-19T14:55:00.233+00:00    TypeError: fetch failed
2023-11-19T14:55:00.233+00:00    at Object.fetch (node:internal/deps/undici/undici:11372:11)
2023-11-19T14:55:00.233+00:00    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
2023-11-19T14:55:00.233+00:00    cause: Error: connect ECONNREFUSED 169.254.172.2:3000
2023-11-19T14:55:00.233+00:00    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1555:16) {
2023-11-19T14:55:00.233+00:00    errno: -111,
2023-11-19T14:55:00.233+00:00    code: 'ECONNREFUSED',
2023-11-19T14:55:00.233+00:00    syscall: 'connect',
2023-11-19T14:55:00.233+00:00    address: '169.254.172.2',
2023-11-19T14:55:00.233+00:00    port: 3000
2023-11-19T14:55:00.233+00:00    }
2023-11-19T14:55:00.233+00:00    }

2023-11-19T14:50:38.432+00:00    Prisma schema loaded from prisma/schema.prisma
2023-11-19T14:50:38.539+00:00    Datasource "db": PostgreSQL database "langfuse", schema "public" at "langfuse.testing.<REDACTED>:5432"
2023-11-19T14:50:39.041+00:00    65 migrations found in prisma/migrations
2023-11-19T14:50:39.244+00:00    No pending migrations to apply.
2023-11-19T14:50:43.548+00:00    ▲ Next.js 13.5.6
2023-11-19T14:50:43.549+00:00    - Local: http://ip-10-64-101-126.eu-west-1.compute.internal:3000
2023-11-19T14:50:43.549+00:00    - Network: http://10.64.101.126:3000
2023-11-19T14:50:43.549+00:00    ✓ Ready in 2s
2023-11-19T14:55:00.233+00:00    TypeError: fetch failed
2023-11-19T14:55:00.233+00:00    at Object.fetch (node:internal/deps/undici/undici:11372:11)
2023-11-19T14:55:00.233+00:00    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
2023-11-19T14:55:00.233+00:00    cause: Error: connect ECONNREFUSED 169.254.172.2:3000
2023-11-19T14:55:00.233+00:00    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1555:16) {
2023-11-19T14:55:00.233+00:00    errno: -111,
2023-11-19T14:55:00.233+00:00    code: 'ECONNREFUSED',
2023-11-19T14:55:00.233+00:00    syscall: 'connect',
2023-11-19T14:55:00.233+00:00    address: '169.254.172.2',
2023-11-19T14:55:00.233+00:00    port: 3000
2023-11-19T14:55:00.233+00:00    }
2023-11-19T14:55:00.233+00:00    }

And some environment variables that may help:

NEXTAUTH_URL = https://<public-dns-of-application-load-balancer>
AUTH_DOMAINS_WITH_SSO_ENFORCEMENT = <domain-of-emails>
AUTH_GOOGLE_CLIENT_ID = <REDACTED>
AUTH_GOOGLE_CLIENT_SECRET = <REDACTED>
SALT = <REDACTED>
NEXTAUTH_SECRET = <REDACTED>
DATABASE_URL = <REDACTED>

NEXTAUTH_URL = https://<public-dns-of-application-load-balancer>
AUTH_DOMAINS_WITH_SSO_ENFORCEMENT = <domain-of-emails>
AUTH_GOOGLE_CLIENT_ID = <REDACTED>
AUTH_GOOGLE_CLIENT_SECRET = <REDACTED>
SALT = <REDACTED>
NEXTAUTH_SECRET = <REDACTED>
DATABASE_URL = <REDACTED>

Port 3000 is being exposed to the load balancer, and we are accessing the load balancer on 443 - this is working fine.
Just the errors in the log is causing the task to constantly fail and be recreated Looking into the codebase, the most likely thing to be causing this is the telemetry. The container will not be reachable by it's internal IP. Trying to disable this. If we disable this, how will this affect the functionality of the service? Great! That does seem to have fixed it! The container has been running for 8 minutes now without issues! :rubber_duck: Would still like to know this though:

If we disable the telemetry, how will it affect the functionality of the service?

Ah, so the telemetry sends information about the usage (e.g. number of resources) to Posthog so the Langfuse team can look at how many people are using it. It's not essential for running Langfuse, but helps the maintainers.

Marc•11mo ago

Yes. Thanks for pointing to this problem. Will fix this asap as telemetry helps us understand rough usage patterns without capturing any sensitive information Sorry for the inconvenience. This should not happen!

p5•11mo ago

No worries at all. If you let us know when there's a fix, we'd be happy to re-enable telemetry on our instance

Marc•11mo ago

fixing this today, will ping you here thanks!

p5•11mo ago

Thanks for fixing that! We've upgraded to v1.3.0 and re-enabled telemetry. Everything seems to be working again 🎉 (Side note: Love the database migration system btw. We don't even need to know something's changing)

Marc•11mo ago

Just wanted to ping you here rn Great that you already saw it (how?)

p5•11mo ago

I constantly check the GitHub dashboard, and it was the most recent post 🙂

Marc•11mo ago

(Side note: Love the database migration system btw. We don't even need to know something's changing)

love it as well, since we moved it to the docker entrypoint it is also super easy for everyone to use haha nice I guess I need more screens to have this on one screen all the time as well

p5•11mo ago

One Slack, one code and one newsfeed (that's Youtube 90% of the time 🤫 ) Could never go back to a single monitor

Marc•11mo ago

Adding this to my wishlist But tbh only laptop gives tons of focus atm Thanks for raising this issue, the change was more complex than expected to work without race conditions under a lot of load but I'm happy that we removed the cron process