Inspiration

As computer science students who spend a lot of time searching for internships, it can get quite tedious for us to copy and paste repeated information. This is especially true if certain sites have different formats that render our autofill useless. As such, we decided that if we could use some form of hand signs to quickly toggle between such various information, it would be a meaningful and interesting project for us.

What it does

When the user first launches the app, it presents a set of 6 various inputs that the user can choose from that they can “preset” the app to remember. Once the user sets their 6 inputs, they can choose to launch the camera to start the application. Using a pre-trained computer vision model, we enabled visual tracking of the users’ hands and processed it to determine which number the user was showing. From here, the user can show the various hand signs to alternate between the kind of information they wish to paste.

How we built it

For hand detection, we used Google’s Mediapipe hand landmark detection model. The hand landmark model bundle detects the keypoint localization of 21 hand-knuckle coordinates within the detected hand regions. From those regions, we selected regions 4, 8, 12, 16, and 20 (the fingertips starting from the thumb to the pinky) to detect whether a user is holding up fingers or not, as well as the exact number the user is holding up. While Google’s Mediapipe hand landmark model is pre-trained and detects hands, we had to implement gesture detection.

The exact implementation is as follows: we take the coordinates detected by Google’s Mediapipe model. We then do a bit of math. Since each coordinates are represented by (x, y), we analyze it accordingly. For example, For the right hand, if the x-coordinate of the thumb tip is less than the x-coordinate of the next landmark down on the thumb (ID 3), it assumes the thumb is extended. For the left hand, it's the opposite: if the x-coordinate of the thumb tip is greater than that of the next landmark, it assumes the thumb is extended. For other fingers, it is simpler. We consider the y coordinate of the fingertip landmark and compare it with the uppermost fingertip joint. If the finger is straight, that means that the person has extended his finger. We then count the number of extended fingers and assume that to be the number the person is showing up.

However, this brings about a few limitations. For example, when users extend their thumb and index finger, it is a gesture that usually signals the number “7”. However, due to the way we implemented the gesture recognition, it will be recognized as “2” instead. Our implementation is limited to numbers 0 - 5. Furthermore, users have to show the gesture with the palm facing the camera. If done with the back of the hand facing the camera, the model will not be able to detect the hands accurately. This is a limitation of Google’s Mediapipe model, so we couldn’t improve on it further.

Challenges we ran into

Initially, we wanted to make a VSCode extension but were met with many limitations due to the unfamiliarity with the VSCode extension development kit and the communication between the camera and the extension. We then transitioned to a web application, which consisted of a React frontend that took screenshots and sent them to a Flask backend, which processed the images and sent the number back to the frontend which will then allow us to copy the corresponding input to the clipboard. However, we quickly found out that this implementation was flawed because it required the webpage to be in focus. This proved to be a big problem as we wanted users to easily switch between clipboard items with a single hand gesture, across various applications and tabs, without the need to constantly keep that tab or webpage in focus. We then decided on our final implementation which was to run it as a native program on the user's computer as a Python programme so that it behaves as a background application.

Accomplishments that we're proud of

We restarted the project three times. While the idea was there, we had to change our approach many times. We managed to produce a full working product despite encountering two main difficulties that pivoted our decision on the final product and we managed to experiment with areas of computing that we individually have yet to manage to work with. Our second iteration which was a full-stack application was completed at 9 pm before we realized that we had the above-mentioned problem. Thus, we went back to square one and worked hard late into the night to finish the application.

What we learned

We used multiple new libraries that we hadn’t used before, specifically Mediapipe, OpenCV, PyperClip and tkinter. We learned much about using computer vision and its usefulness in creating cool and whacky applications. At the same time, we realized its limitations, especially in tracking a live video feed in real-time. We also learned about the limitations of a web application, specifically how only active tabs will execute JavaScript code.

What's next for JutsuClip

We wish to make this into a VSCode extension for future developments, as our initial idea was to use JutsuClip to replace the mundane repeated copying and pasting routine. We also wish to include more custom modifications and improvements to our computer vision model as we want to allow for more complex hand gestures to produce varied outputs.

Built With

Share this project:

Updates