TrafficLab 3D: FAQs
Table of Content
Block
- Github Repo at duy-phamduc68/trafficlab-3d
- Youtube Demo at TrafficLab 3D v1.0 Demo
- Youtube Guide at TrafficLab 3D Guide
- Check out full academic report at Google Drive
Prologue
This is it, this was the project that helped me figured out my “niche” in the field of AI recently. Lately, I’ve been very interested in digital twin, autonomous vehicle, which in turn translate to robotics (might be a long shot, AVs is a type of robot). Now, about TrafficLab 3D (TL3D), as you can see above I’ve already attached my academic report for the project, so if you really want to dive deep and learn about the technical side of TL3D, I’d suggest you checking it out. With that being said, in this post, I’m not going to talk much about TL3D itself, rather, the context surrounding it, what it meant for me then, and going forward, ie. TL3D’s future roadmap. Notably, I’m going to detail TL3D’s current weaknesses, which directly relates to my next project that I’m doing my best to work on.
Update: The above prologue was written a few days ago, now, I wanted to try something new with my posts, that is, im going to present more of an FAQ/interview style writing, I believe that this is way better than dwelling on trying to get the most polished, formatted article ever for a given post. At least for this one which I have procrastinated for a while already. I believe reading this will be way more interesting and engaging, but it is an experiment, so perhaps let me know what you think in the comments, all suggestions are welcomed.
Q1. Before TrafficLab3D, what was bothering you about the way people approach traffic / AV / CV problems?
I initially approached this problem with the literal worst case scenario for this kind of task - hanoi traffic (where i currently live), then i should make some comparison to the studies i cited, the fact that i start with Hanoi (and eventually kinda fail to completely solve it) affect this project greatly.
- First, most papers relied on near-perfect projection between cctv view and satellite view, either through very trivial matching (clear markings) or already known camera model, Hanoi have neither of those, not to mention, my data source was 720p, but actually its bit rate is terrible (something i realized quite late), tought me that not just the resolution but also bit-rate and motion blur matters greatly for detectio and tracking.
- Secondly, these papers often showcase demo in very standard homogenous traffic, my main reference even have the detector classify subclass like sedan, hatchback, pickup truck, etc… while in Hanoi, yeah good luck with that, 2 wheels, 3 wheels, 4 wheels, sizes… i had to rethink how detection could work in a scenario like this (this is worth mentioning, i had a bit of tinkering with crowd counting models, more on that later).
- Finally, the sheer amount of traffic compared to western context, especially motorbikes. Many of these aspects mean i didn’t allow myself to just clone a research repo and like reimplement it or something, I wanted a pipeline that was truly tailored for my use case context (Hanoi)
Q2. Why MP4 + Google Maps specifically?
I didn’t just want to do what was “proven” to work, like those fancy demo ran on currated datasets and established testing site, perfect camera and all, i believe that,
- one, solve my own problem that i believe they couldn’t solve,
- two, if i can solve “worst case”, it implies solving everything else, kind of a mindset that i might have unintentially inherited from my goat in AI, Demis Hassabis, “solve intellegence, then solve everything else”
Q3. Is TL3D a tool, a demo system, or a statement?
I think it doesnt quite fit cleanly into any categories on that list, its kinda little a bit everything. But yeah because of that, id say its more of a statement, its like, my entry into this niche i think, one that i can do in my current circumstances (limited resources, no access to higher quality data, etc…) even if western context data was more accessible, even if they have proper evaluation than my so called academic report that doesnt even particularly measure anything, i still did it
Q4. What does “digital twin” mean in your definition?
Yeah, i agree that this term is becoming more of a stun, but, at least in my definition, i think that its not just visualization, but an interactive, tractable, modifiable simulation,
for example, for my TL3D, i believe that until I myself can call it a digital twin, at the very least, its road users have to be accurate, account for uncertainty, and perhaps, instead of just a state, a full derived agent, this allow for hypothesis testing, scenario generation, high fidelity simulation.
in this regard, i admit that using the term “digital twin” in this project was a bit overblown hehe… but, this is exactly where im heading beyond TL3D
Q5. Where does TL3D “cheat”?
yeah this question sounds like its about weaknesses, so ill list them out for now, along with possible solution direction:
- manual calibration -> semi-supervised calibration,
- flat ground -> not sure yet, i have to solve flat before i can move to non planar env,
- ignoring uncertainty -> yes, this will probably need heuristic combined with probabilistic observation, i plan to approach it not by listing every single uncertainty there are (eg. a sedan behind a truck…), but yes one of the main weakness that i
haveto solve for my next “vision” of TL3D.
its biggest cheat is probably: its not exactly “3D”, it has 3D bounding box yes, but those are very artificial… basically, if you already have tracked 2D box and the projection, then you already have 3D box, so I didn’t really innovate on top of some existing monocular 3D detection pipeline (cite some papers),
but then again, remember that i started with Hanoi, for such a non-formal and under-explored context, I approached this assuming that I just gotta trust my gut on what will and wont work, rather than expecting some paper to just come in and solve my problem (i read them thoroughly and decided that it wasn’t worth diving deep into in many ways, basically, you have to understand that sometimes, these papers are only for… well, the paper,
for me? I want to take this idea to city scale, and I have to put some thoughts into what would scale and what wont, one of which i rejected, is monolithic “image in, 3D box out” like those used in AVs, at least, one that i wont explore yet)
Q6. What breaks first if you scale this?
truth is, I need a data source like that first before I can really answer this question,
but ill tell you this, I’m not really a fan of cameras everywhere, not really a technical opinion but a sociological one, if i had to say it for now, my “personally acceptable context”, is 2 camera per complex intersection, maybe 4 if its a massive jumble crossing like those in tokyo.
Q7. What part of the system feels the most ‘fake’ to you right now?
yeah, speed are non-sensical lol… its also a weakness in TL3D, pretty much vanilla detector and tracker,
I do believe that for a very specific context like what im aiming at, complex traffic intersection with mixed traffic, we have a lot of things we can “leverage”, just think about it, if it was a city wide system, a car can’t just disappear in the middle of the road, nor can it just “pop in”, this does touch one of my current philosophy about the project though, if it helps, heres an interaction on my youtube demo:
Block
@Luxtini1 month ago
Good project, I will try to try it but with pedestrian view videos. Do you plan to implement a counting function per line in the street corners?
@yuk.yukano1 month ago (edited)
It’s definitely something that can be done, though I might not be too interested in downstream heuristic-based tasks like counting just yet, it can obviously be leveraged for that. My priority with TrafficLab right now is to improve its monocular accuracy, if you notice the detections degrade as objects are further away from the camera, and sometimes box orientation can be nonsensical, I’m trying to resolve those first. I believe that only when detection are robust should we move to further processing. Thank you for checking out my project!
Q8. What comes first: accurate world modeling or accurate agent behavior?
Haha yeah it kinda relates, which is why I do think that, before I can do my final vision of this, Id need a bunch of sub projects as well, one of which is a revised version of TL3D, where it generates a “common trajectory map” instead of immediately jumping to digital twin
Q9. Where does TL3D sit in that loop?
Its world first right now i believe, its still very much a visualization layer than anything else
Q10. What did this project change about how you see AI?
A bunch of stuff honestly,
- one, I’m not really that interested in LLMs and such anymore, I will still apply them when fit (eg. maybe for my final vision, there could be a natural language interface, “give me traffic volume of intersection 18, give me all near grid lock locations today, etc…”) but yeah, for now, working on those LLM stuff, im just not in a position for those yet (both in competition and compute limitation).
- second, i realized that some research repo are abhorrent lol, my tinkering with crowd counting models was really painful… made me realized that, theres a reason why people do more “structured, predictable” work, but then, making these research actually useful for others, sometimes is not their priority.
- third, yes, “digitizing the world” is quite the new buzzword in my head, physically constrained AI maybe.
Q11. Why this niche specifically?
Well many things, but, for the first time, I feel like im not throwing shits against the wall anymore, I genuinely felt enjoyable throughout this project,
and perhaps more importantly, my github repo, as of posting this sits around 165 stars and 20 forks, this is the most amount of support ive ever received for a github repo (i dont even get stars usually), and yeah, after diving into this so deep and genuinely enjoy it, i dont want to miss it, i want to dig deeper, and that final vision of mine? its not “i have to get to it first”, or “i have to stay ahead of the curve”, its simply, my potential “magnum opus” if you will
Q12. If you could remove ONE manual step in TL3D, what would it be?
definitely calibration lol, think about city scale, there could be thousands to ten of thousands camera, other stuff, its a more direct generalization research direction, but for calibration, considering its kinda pointless if we already have the camera model, if i can automate or make it semi-supervised as soon and as effortless as possible (meaning i also dont want to spend too much time coming up with the solution it self), the better, at least until i have access to such resources, which are usually government-locked.
Q13. What would the “version that actually matters” look like?
yeah, i think you can imagine that vision already from everything ive said so far, basically TL3D but city-scale, 3D lol, a good parallel might be https://demo.f4map.com/, just zoom into a popular city and you should see 3D buildings with (fake) 3D cars moving around the streets.
Q14. What are you trying to prove next?
I think… I want to breathe a bit of a new perspective into this field, not the kind of coding agents into RL and simulation environment by hand anymore, and not by hallucinating pixels with something like video models/world models
Q15. Why does starting from Hanoi invalidate existing assumptions?
Yeah, looing at Hanoi traffic data for so long, not even having driven in hanoi or everywhere for that matter myself, i still realized that traffic is not just a geometric or flow engineering problem, sometimes its a sociological one, for Hanoi, rules are there, but they often have to be broken to prevent the city from coming to a halt, such behaviors are dangerous of course, but its kinda of a culture already, such behavior can be consider, as in Vietnamese “khôn lỏi”, but I believe its implicitly learned survival, I actually don’t know how this affect my technical side of things, just something i thought about
Q16. Re-evaluating the “worst-case solves everything” idea
simply, I wanted to show something thats “me”, how, taking the unconventional path, can take you to unexpected places. So instead of forking a “end-to-end” pipeline and call it a day, i learned about all these things, a near 100 page report, and an entire application GUI, etc…
Q17. Why reject monolithic pipelines?
yeah pretty much, but also, adaptability as well, so far, vanilla 2D detection models seems to be beest for that (YOLO, DETR,…), i imagine it wouldn’t be easy for such a monocular 3D model to just “pick up” say threewheelers or maybe in the future, even weird looking self-driving shuttle and delivering bots, my assumption of course, you can comment if they can do that and cite a study, but for now, this is “my thinking”, i can’t do everything, and i actually already believe that my final vision, hell TL3D itself, should have been a group effort, but, for my current environment, i only have myself, and i hope that, my final vision at least can connect me to the people that matters
Q18. What breaks conceptually at city scale?
I actually didnt give that much thought yet, just mentioning it for anchor, i dont have that data yet already, but yeah, if im able to perfect TL3D, ie, this idea but for 1 intersection, then city scale version would present a very different problem set and application set i believe
Q19. On “not really 3D” and shifting standards
I think i kinda already answered this, but basically, my standard grew, this is quite personal, i often have this habit of constantly invalidating myself, prime example when i achieve my unexpected 8.0 ielts, i felt suprised and relief for like 5 seconds, but then immediately lost the joy, and just kinda feel numb, its probably not healthy, but, i hope i can make it a healthy thing one day, without rejecting myself
Q20. Expanding the “common trajectory map” idea
Im thinking it should be like common, vectorized paths so that agents can default to if their detection start to get noisy, havent figure it out all, this is just on the roadmap so
Epilogue
Basically, if I had to sum it up, the actual biggest weakness of TL3D right now is that it doesn’t concern uncertainty, that can be projection mismatch, detector and tracker noise, and so on, in order to achieve a high fidelity, physically accurate digital twin, we have to be able to work under uncertainty.
Anyway, thats about it. Thanks for checking out my project, consider giving it a star on Github if it was interesting to you duy-phamduc68/trafficlab-3d. Tell me what you think in the comments. Until next time :).